CN111860130A

CN111860130A - Audio-based gesture recognition method, device, terminal device and storage medium

Info

Publication number: CN111860130A
Application number: CN202010505950.5A
Authority: CN
Inventors: 张进; 马鸿
Original assignee: Southern University of Science and Technology
Current assignee: Southern University of Science and Technology
Priority date: 2020-06-05
Filing date: 2020-06-05
Publication date: 2020-10-30
Anticipated expiration: 2040-06-05
Also published as: CN111860130B

Abstract

The present application is applicable to the technical field of human-computer interaction, and provides an audio-based gesture recognition method, device, terminal device and storage medium. The audio-based gesture recognition method includes: acquiring a target audio signal, the target audio signal is an audio signal received after a preset original audio signal is modulated and propagated through a target gesture made by a user; Channel estimation is performed on the audio signal and the target audio signal to obtain characteristic data of the channel estimation; the characteristic data of the channel estimation is identified to obtain the recognition result of the target gesture. The present application extracts the feature data of the target gesture on the original audio signal and the target audio signal through channel estimation, so as to obtain accurate channel estimation feature data, so as to improve the accuracy of the gesture recognition result.

Description

Audio-based gesture recognition method, device, terminal device and storage medium

技术领域technical field

本申请属于人机交互技术领域，尤其涉及一种基于音频的手势识别方法、装置、终端设备和存储介质。The present application belongs to the technical field of human-computer interaction, and in particular, relates to an audio-based gesture recognition method, device, terminal device and storage medium.

背景技术Background technique

随着智能设备的普及，智能设备所配备的传感器也越来越多，这使得利用现有的商业设备中内嵌的传感器进行手势识别变得越来越方便。With the popularity of smart devices, more and more sensors are equipped with smart devices, which makes it more and more convenient to use sensors embedded in existing commercial devices to perform gesture recognition.

现有的基于音频的手势识别方法，通常是基于连续波的多普勒效应进行手势识别。采用这种方式虽然能克服基于视觉、可穿戴设备的惯性传感器进行手势识别所带来的使用场景受限问题，但由于采用的连续波信号的分辨率较低，导致手势识别的准确率较低。The existing audio-based gesture recognition methods are usually based on the continuous wave Doppler effect for gesture recognition. Although this method can overcome the limited use scene caused by gesture recognition based on inertial sensors of vision and wearable devices, the accuracy of gesture recognition is low due to the low resolution of the continuous wave signal used. .

发明内容SUMMARY OF THE INVENTION

有鉴于此，本申请实施例提供了一种基于音频的手势识别方法、装置、终端设备和存储介质，通过信道估计对原始音频信号和目标音频信号进行目标手势的特征数据提取，获得精确的信道估计的特征数据，以提高手势识别结果的准确率。In view of this, the embodiments of the present application provide an audio-based gesture recognition method, device, terminal device, and storage medium, which extract feature data of target gestures on the original audio signal and the target audio signal through channel estimation to obtain accurate channel Estimated feature data to improve the accuracy of gesture recognition results.

第一方面，本申请实施例提供了一种基于音频的手势识别方法，包括：In a first aspect, an embodiment of the present application provides an audio-based gesture recognition method, including:

获取目标音频信号，所述目标音频信号为预设的原始音频信号在调制后，传播经过用户做出的目标手势后接收到的音频信号；Acquiring a target audio signal, the target audio signal is the audio signal received after the preset original audio signal is modulated and propagated through the target gesture made by the user;

基于所述原始音频信号和所述目标音频信号进行信道估计，得到信道估计的特征数据；Perform channel estimation based on the original audio signal and the target audio signal to obtain characteristic data of the channel estimation;

对所述信道估计的特征数据进行识别，得到所述目标手势的识别结果。Identify the feature data of the channel estimation to obtain the recognition result of the target gesture.

本申请实施例对调制后原始音频进行播放，通过用户做出手势，获取包含手势特征的目标音频信号，通过原始音频信号和目标音频信号进行信道估计，得到精确的信道估计的特征数据，将获得的所述信道估计的特征数据进行识别，输出准确的目标手势识别结果。In this embodiment of the present application, the modulated original audio is played, the user makes a gesture to obtain a target audio signal containing gesture features, and channel estimation is performed by using the original audio signal and the target audio signal to obtain accurate channel estimation feature data. The feature data of the channel estimation is identified, and an accurate target gesture recognition result is output.

进一步地，所述原始音频信号为周期信号，所述基于所述原始音频信号和所述目标音频信号进行信道估计，得到信道估计的特征数据包括：Further, the original audio signal is a periodic signal, and the channel estimation is performed based on the original audio signal and the target audio signal to obtain characteristic data of the channel estimation including:

对所述目标音频信号进行解调处理，得到目标基频信号；Perform demodulation processing on the target audio signal to obtain a target fundamental frequency signal;

对所述目标基频信号进行分段，得到多个目标信号片段，每个所述目标信号片段的长度均和所述原始音频信号的周期相同；The target fundamental frequency signal is segmented to obtain a plurality of target signal segments, and the length of each target signal segment is the same as the period of the original audio signal;

对于每个所述目标信号片段，均分别与所述原始音频信号中一个周期的信号片段进行信道估计，得到各自的信道特征数据；For each of the target signal segments, channel estimation is performed with a period of signal segments in the original audio signal, respectively, to obtain respective channel characteristic data;

将各个所述目标信号片段的信道特征数据整合，得到所述信道估计的特征数据。The channel characteristic data of each of the target signal segments are integrated to obtain the characteristic data of the channel estimation.

通过将目标音信信号做解调、按周期分段获得目标信号片段，然后利用原始音频信号中的一个周期的信号片段分别与每个目标信号片段进行信道估计得到各自的信道特征数据，最后将各个信道特征数据合并，即可获得更为准确的信道估计的特征数据，从而提高手势识别的准确率。The target signal segment is obtained by demodulating the target audio signal and segmenting it according to the period, and then using a period signal segment in the original audio signal to perform channel estimation with each target signal segment to obtain the respective channel characteristic data. By combining the channel feature data, more accurate channel estimation feature data can be obtained, thereby improving the accuracy of gesture recognition.

进一步的，所述对所述目标音频信号进行解调处理，得到目标基频信号包括：Further, performing demodulation processing on the target audio signal to obtain the target fundamental frequency signal includes:

对所述目标音频信号进行降载波和IQ分解，得到降载波信号的实部信号和虚部信号；Carrying out carrier reduction and IQ decomposition to the target audio signal to obtain a real part signal and an imaginary part signal of the reduced carrier signal;

使用低通滤波器对所述降载波信号的实部信号和虚部信号去噪，得到所述目标基频信号。Using a low-pass filter to denoise the real part signal and the imaginary part signal of the down-carrier signal to obtain the target fundamental frequency signal.

由于麦克风接收的音频信号属于原始音频信号调制后的音频信号，而信道估计实用的是调制前的信号进行。因此需要对目标音频信号进行解调，即进行降载波和IQ分解，得到降载波信号的实部信号和虚部信号；再将降载波信号的实部信号和虚部信号通过低通滤波器，得到过滤掉噪音干扰的目标音频信号，提升目标音频信号的分辨率。Since the audio signal received by the microphone belongs to the audio signal modulated by the original audio signal, channel estimation is practically performed on the signal before modulation. Therefore, it is necessary to demodulate the target audio signal, that is, carry out carrier reduction and IQ decomposition to obtain the real part signal and imaginary part signal of the reduced carrier signal; Obtain the target audio signal with noise interference filtered out, and improve the resolution of the target audio signal.

进一步的，每个所述目标信号片段均带有时间戳，所述将各个所述目标信号片段的信道特征数据整合，得到所述信道估计的特征数据包括：Further, each of the target signal segments has a time stamp, and the channel feature data obtained by integrating the channel feature data of each of the target signal segments to obtain the channel estimation feature data includes:

将各个所述目标信号片段的信道特征数据分别表示为向量的形式，得到各个所述目标信号片段的特征向量；Representing the channel feature data of each of the target signal segments in the form of vectors, respectively, to obtain a feature vector of each of the target signal segments;

将各个所述目标信号片段的特征向量按照各自对应的时间戳大小进行排列，构成特征矩阵；Arranging the feature vectors of each of the target signal segments according to their respective timestamp sizes to form a feature matrix;

对于所述特征矩阵中的每一列，均将各自包含的各个元素值分别减去前一列包含的各个对应元素值，得到静态消除后的所述特征矩阵；For each column in the feature matrix, the respective element values contained in each are subtracted from the corresponding element values contained in the previous column to obtain the feature matrix after static elimination;

将静态消除后的所述特征矩阵确定为所述信道估计的特征数据。The eigenmatrix after static elimination is determined as eigendata of the channel estimation.

通过将信道特征数据转化为特征向量，并依据时间顺序将特征向量进行排列合并生成特征矩阵，基于特征矩阵的前后列的各个元素值分别进行相减获得静态消除后的特征矩阵，即信道估计的特征数据。静态消除后的信道估计特征数据可以消减静态反射信号对手势信号的影响，实现对细粒度手势的准确识别。By converting the channel feature data into feature vectors, and arranging and combining the feature vectors according to the time sequence, a feature matrix is generated, and the feature matrix after static elimination is obtained by subtracting the element values of the front and rear columns of the feature matrix. characteristic data. The channel estimation feature data after static elimination can reduce the influence of the static reflection signal on the gesture signal, and realize accurate recognition of fine-grained gestures.

进一步的，所述对所述信道估计的特征数据进行识别，得到所述目标手势的识别结果包括：Further, the identifying the feature data of the channel estimation to obtain the recognition result of the target gesture includes:

采用预先构建的域自适应神经网络模型对所述信道估计的特征数据进行识别，得到所述目标手势的识别结果；Using a pre-built domain adaptive neural network model to identify the feature data of the channel estimation to obtain a recognition result of the target gesture;

其中，所述域自适应神经网络模型通过以下步骤构建：Wherein, the domain adaptive neural network model is constructed by the following steps:

获取第一训练数据集和第二训练数据集，所述第一训练数据集包含多组已分配域标签和手势标签的信道估计样本数据，所述第二训练数据集包含多组已分配域标签但未分配手势标签的信道估计样本数据；Obtain a first training data set and a second training data set, the first training data set includes multiple sets of channel estimation sample data with assigned domain labels and gesture labels, and the second training data set includes multiple sets of assigned domain labels Channel estimation sample data without assigning gesture labels;

以所述第一训练数据集和所述第二训练数据集作为训练集，训练得到所述域自适应神经网络模型。Using the first training data set and the second training data set as training sets, the domain adaptive neural network model is obtained by training.

预先构建好的域自适应神经网络模型，通过大量分配域标签和手势标签的信道估计样本和仅分配了域标签没有分配手势标签的信道估计样本对所述神经网络模型进行训练，可以实现对域无关的目标特征手势提取和识别，优化识别的准确率。The pre-built domain adaptive neural network model trains the neural network model through a large number of channel estimation samples assigned domain labels and gesture labels and channel estimation samples that are only assigned domain labels but not assigned gesture labels. Gesture extraction and recognition of irrelevant target features to optimize the accuracy of recognition.

进一步的，所述以所述第一训练数据集和所述第二训练数据集作为训练集，训练得到所述域自适应神经网络模型包括：Further, using the first training data set and the second training data set as training sets, and obtaining the domain adaptive neural network model through training includes:

将所述第一训练数据集输入初始域自适应神经网络模型，得到所述第一训练数据集的手势预测结果和域预测结果；Inputting the first training data set into an initial domain adaptive neural network model to obtain the gesture prediction result and the domain prediction result of the first training data set;

根据所述第一训练数据集已分配的域标签和手势标签，以及所述第一训练数据集的手势预测结果和域预测结果，计算得到第一交叉熵E_label；According to the assigned domain label and gesture label of the first training data set, and the gesture prediction result and the domain prediction result of the first training data set, calculate the first cross entropy E _label ;

将所述第二训练数据集输入所述初始域自适应神经网络模型，得到所述第二训练数据集的手势预测结果和域预测结果；Inputting the second training data set into the initial domain adaptive neural network model to obtain the gesture prediction result and the domain prediction result of the second training data set;

根据所述第二训练数据集已分配的域标签，以及所述第二训练数据集的手势预测结果和域预测结果，计算得到第二交叉熵E_unlab；Calculate the second cross entropy E _unlab according to the assigned domain label of the second training data set, and the gesture prediction result and the domain prediction result of the second training data set;

根据所述第一训练数据集已分配的域标签、所述第二训练数据集已分配的域标签、所述所述第一训练数据集的域预测结果以及所述第二训练数据集的域预测结果，计算得到域交叉熵E_s；Based on the assigned domain labels of the first training dataset, the assigned domain labels of the second training dataset, the domain prediction results of the first training dataset and the domain of the second training dataset The prediction result is calculated to obtain the domain cross entropy E _s ;

根据所述第一交叉熵、所述第二交叉熵和所述域交叉熵计算所述初始域自适应神经网络模型的交叉熵E＝E_label+αE_unlab-βE_s,其中α与β为域自适应神经网络模型的模型参数；Calculate the cross-entropy of the initial domain adaptive neural network model according to the first cross-entropy, the second cross-entropy and the domain cross-entropy E=E _label +αE _unlab -βE _s , where α and β are domains Model parameters of the adaptive neural network model;

以所述交叉熵E取最小值为目标，对所述初始域自适应神经网络模型的模型参数进行优化，得到所述域自适应神经网络模型。Taking the minimum value of the cross entropy E as the goal, the model parameters of the initial domain adaptive neural network model are optimized to obtain the domain adaptive neural network model.

通过分别计算两个训练数据集手势识别的第一交叉熵、第二交叉熵和域交叉熵，构建出域自适应神经网络模型的交叉熵，把所述交叉熵的最小值当做目标，调整域自适应网络模型的模型参数进行优化，得到优化后的域自适应神经网络模型。优化过程中降低域独有的和手势无关的信号特征权重，提高对新域数据集识别的准确率，提高系统的泛化性。By calculating the first cross-entropy, the second cross-entropy and the domain cross-entropy of gesture recognition in the two training datasets respectively, the cross-entropy of the domain-adaptive neural network model is constructed, and the minimum value of the cross-entropy is taken as the target, and the domain is adjusted The model parameters of the adaptive network model are optimized to obtain the optimized domain adaptive neural network model. In the optimization process, the weight of signal features that are unique to the domain and have nothing to do with gestures is reduced, the accuracy of recognition of new domain datasets is improved, and the generalization of the system is improved.

进一步的，所述获取目标音频信号包括：Further, the obtaining the target audio signal includes:

获取由N个不同的麦克风设备分别采集到的N份所述目标音频信号，N为大于1的整数；Obtaining N copies of the target audio signal collected by N different microphone devices, where N is an integer greater than 1;

所述基于所述原始音频信号和所述目标音频信号进行信道估计，得到信道估计的特征数据包括：The channel estimation is performed based on the original audio signal and the target audio signal, and the characteristic data obtained for the channel estimation includes:

对于每份所述目标音频信号，均分别与所述原始音频信号进行信道估计，得到N份信道特征数据；For each piece of the target audio signal, perform channel estimation with the original audio signal respectively to obtain N pieces of channel feature data;

对所述N份信道特征数据进行整合，得到所述信道估计的特征数据。The N pieces of channel characteristic data are integrated to obtain the characteristic data of the channel estimation.

采用多个音频采集设备进行目标音频信号的多维度采集处理，可以获取更多信道估计的特征数据，提高手势识别结果。比如手势顺时针旋转和逆时针旋转时，两个手势呈镜像关系，如果采用单麦克风和单扬声器时，只能计算得到1维信道估计信息，无法识别具体手势时顺时针旋转还是逆时针旋转。但如果采用多个麦克风设备，就可以获取到更多的手势信息，从而判断出具体是顺时针旋转还是逆时针旋转。Using multiple audio acquisition devices to perform multi-dimensional acquisition and processing of target audio signals can obtain more feature data for channel estimation and improve gesture recognition results. For example, when the gesture is rotated clockwise and counterclockwise, the two gestures are in a mirror image relationship. If a single microphone and a single speaker are used, only 1-dimensional channel estimation information can be calculated, and it is impossible to identify whether the specific gesture is rotated clockwise or counterclockwise. However, if multiple microphone devices are used, more gesture information can be obtained, so as to determine whether to rotate clockwise or counterclockwise.

第二方面，本申请实施例提供了一种基于音频的手势识别装置，包括：In a second aspect, an embodiment of the present application provides an audio-based gesture recognition device, including:

信号获取模块，用于获取目标音频信号，所述目标音频信号为预设的原始音频信号在调制后，传播经过用户做出的目标手势后接收到的音频信号；a signal acquisition module, configured to acquire a target audio signal, where the target audio signal is an audio signal received after a preset original audio signal is modulated and propagated through a target gesture made by a user;

信道估计模块，用于基于所述原始音频信号和所述目标音频信号进行信道估计，得到信道估计的特征数据；a channel estimation module, configured to perform channel estimation based on the original audio signal and the target audio signal to obtain characteristic data of the channel estimation;

手势识别模块，用于对所述信道估计的特征数据进行识别，得到所述目标手势的识别结果。The gesture recognition module is used for recognizing the feature data of the channel estimation to obtain the recognition result of the target gesture.

第三方面，本申请实施例提供了一种终端设备，包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现上述第一方面所述的手势识别方法的步骤。In a third aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, when the processor executes the computer program The steps of implementing the gesture recognition method described in the first aspect above.

第四方面，本申请实施例提供了一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序被处理器执行时实现如上述第一方面所述的手势识别方法的步骤。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the gesture recognition according to the first aspect above is implemented steps of the method.

本申请实施例与现有技术相比存在的有益效果是：通过信道估计对原始音频信号和目标音频信号进行目标手势的特征数据提取，获得精确的信道估计的特征数据，以提高手势识别结果的准确率。Compared with the prior art, the embodiment of the present application has the beneficial effect of extracting the feature data of the target gesture on the original audio signal and the target audio signal through channel estimation, so as to obtain accurate channel estimation feature data, so as to improve the accuracy of the gesture recognition result. Accuracy.

附图说明Description of drawings

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only for the present application. In some embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.

图1是本申请实施例提供的一种基于音频的手势识别方法的流程图；1 is a flowchart of an audio-based gesture recognition method provided by an embodiment of the present application;

图2是本申请实施例提供的一种信道估计方法的流程图；FIG. 2 is a flowchart of a channel estimation method provided by an embodiment of the present application;

图3是本申请实施例提供的信道估计的原理示意图；FIG. 3 is a schematic diagram of the principle of channel estimation provided by an embodiment of the present application;

图4是本申请实施例提供的经过极大似然估计算法后获得的信道估计的特征数据对应的热度图；4 is a heat map corresponding to characteristic data of channel estimation obtained after a maximum likelihood estimation algorithm provided by an embodiment of the present application;

图5是本申请实施例提供的静态消除后信道估计的特征数据对应的热度图和放大的单个手势的信道估计特征数据对应的热度图；5 is a heat map corresponding to feature data of channel estimation after static elimination and a heat map corresponding to channel estimation feature data of an enlarged single gesture provided by an embodiment of the present application;

图6是本申请实施例提供的一种域自适应神经网络模型结构示意图；6 is a schematic structural diagram of a domain adaptive neural network model provided by an embodiment of the present application;

图7是本申请实施例提供的调用神经网络模型阈值判定参考图；7 is a reference diagram for threshold determination of a calling neural network model provided by an embodiment of the present application;

图8是本申请实施例提供的一种基于音频的手势识别装置的结构示意图；8 is a schematic structural diagram of an audio-based gesture recognition device provided by an embodiment of the present application;

图9是本申请实施例提供的一种终端设备的示意图。FIG. 9 is a schematic diagram of a terminal device provided by an embodiment of the present application.

具体实施方式Detailed ways

以下描述中，为了说明而不是为了限定，提出了诸如特定装置结构、技术之类的具体细节，以便透彻理解本申请实施例。然而，本领域的技术人员应当清楚，在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中，省略对众所周知的装置、电路以及方法的详细说明，以免不必要的细节妨碍本申请的描述。In the following description, for the purpose of illustration rather than limitation, specific details such as specific device structures and technologies are set forth in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to those skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

以下实施例中所使用的术语只是为了描述特定实施例的目的，而并非旨在作为对本申请的限制。如在本申请的说明书和所附权利要求书中所使用的那样，单数表达形式“一个”、“一种”、“所述”、“上述”、“该”和“这一”旨在也包括例如“一个或多个”这种表达形式，除非其上下文中明确地有相反指示。还应当理解，在本申请实施例中，“一个或多个”是指一个、两个或两个以上；“和/或”，描述关联对象的关联关系，表示可以存在三种关系；例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B的情况，其中A、B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。The terms used in the following embodiments are for the purpose of describing particular embodiments only, and are not intended to be limitations of the present application. As used in the specification of this application and the appended claims, the singular expressions "a," "an," "the," "above," "the," and "the" are intended to also Expressions such as "one or more" are included unless the context clearly dictates otherwise. It should also be understood that, in this embodiment of the present application, "one or more" refers to one, two or more; "and/or", which describes the association relationship of associated objects, indicates that there may be three kinds of relationships; for example, A and/or B can mean that A exists alone, A and B exist simultaneously, and B exists independently, wherein A and B can be singular or plural. The character "/" generally indicates that the associated objects are an "or" relationship.

本申请实施例提供的基于音频的手势识别方法可以基于手机、平板电脑、医疗设备、可穿戴设备、车载设备、增强现实(augmented reality，AR)/虚拟现实(virtualreality，VR)设备、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer，UMPC)、上网本、个人数字助理(personal digital assistant，PDA)等具有扬声器和麦克风的终端设备或者服务器实现，本申请实施例对终端设备和服务器的具体类型不作任何限制。The audio-based gesture recognition method provided by the embodiments of the present application may be based on mobile phones, tablet computers, medical devices, wearable devices, in-vehicle devices, augmented reality (AR)/virtual reality (VR) devices, laptops, The implementation of terminal devices or servers with speakers and microphones such as ultra-mobile personal computers (UMPCs), netbooks, personal digital assistants (PDAs), etc. No restrictions apply.

巴克码是50年代初，R.H.巴克提出的一种具有特殊规律的二进制码组，具有较好的自相关性，因此应用到信道估计中对信道的特征数据提取效果也更好。将巴克码经过拉伸、平滑、补领、扩增后，生成单个周期长度大于300比特的基频信号作为原始音频信号即可获得很好的分辨率，运用到信道估计中，最终得到的信道估计的特征数据也更准确。虽然周期长度越长分别率越高，但是周期过长会造成系统延迟。经过实验反复验证，较优的选择是单个周期长度为480比特的基频信号，既能兼顾信号的分辨率，也可以避免系统延迟的现象。为了方便理解，后续的例子会直接用单个周期长度为480比特的原始音频信号作为说明。Barker code is a kind of binary code group with special regularity proposed by R.H. Barker in the early 1950s. It has good autocorrelation, so it is also better for channel feature data extraction when applied to channel estimation. After the Barker code is stretched, smoothed, supplemented, and amplified, a fundamental frequency signal with a single cycle length greater than 300 bits is generated as the original audio signal to obtain a good resolution. The estimated feature data is also more accurate. Although the longer the cycle length, the higher the resolution, but too long the cycle will cause system delay. After repeated experiments and verification, the better choice is the fundamental frequency signal with a single cycle length of 480 bits, which can not only take into account the resolution of the signal, but also avoid the phenomenon of system delay. For the convenience of understanding, the following examples will directly use the original audio signal with a single cycle length of 480 bits as an illustration.

图1示出了本申请提供的一种基于音频的手势识别方法的流程图，在一个实施例中，所述手势识别方法包括：FIG. 1 shows a flowchart of an audio-based gesture recognition method provided by the present application. In one embodiment, the gesture recognition method includes:

101、获取目标音频信号，所述目标音频信号为预设的原始音频信号在调制后，传播经过用户做出的目标手势后接收到的音频信号；101. Obtain a target audio signal, which is an audio signal received after a preset original audio signal is modulated and propagated through a target gesture made by a user;

首先获取目标音频信号，所述目标音频信号的获取是采用预先设置的原始音频信号，调制后通过扬声器进行播放，调制的原始音频信号在传播过程中，经过用户做出的目标手势后，会改变传播的音频信号的信道特征数据，此时，麦克风设备就可以接受到含有目标手势特征数据的目标音频信号。原始音频信号调制后才播放是因为原始音频信号即基频信号，直接播放非常刺耳；且信号与系统理论指出：信号在基频计算出的信道与信号在载频经过的信道一致，因此可以使用解调得到的目标基频信号和原始音频信号进行信道估计，得到的信道估计结果与上载频后通过的信道估计结果一致，准确率高。所以在播放的时候会对原始音频信号进行调制，具体的在本实施例中是通过对原始音频信号进行载波，达到让人耳无法识别的波段再进行播放，后续再将目标音频信号解调成基频信号进行信道估计。First, the target audio signal is obtained. The target audio signal is obtained by using a preset original audio signal, which is modulated and played through a speaker. During the propagation process, the modulated original audio signal will change after the target gesture made by the user. Channel characteristic data of the propagated audio signal, at this time, the microphone device can receive the target audio signal containing the target gesture characteristic data. The original audio signal is played after modulation because the original audio signal is the fundamental frequency signal, which is very harsh to play directly; and the signal and system theory points out that the channel calculated by the signal at the fundamental frequency is consistent with the channel that the signal passes through at the carrier frequency, so it can be used Channel estimation is performed on the demodulated target fundamental frequency signal and the original audio signal, and the obtained channel estimation result is consistent with the channel estimation result passed after uploading the frequency, and the accuracy rate is high. Therefore, the original audio signal is modulated during playback. Specifically, in this embodiment, the original audio signal is subjected to carrier wave to reach a frequency band that cannot be recognized by human ears, and then the playback is performed, and then the target audio signal is demodulated into The baseband signal is used for channel estimation.

102、基于所述原始音频信号和所述目标音频信号进行信道估计，得到信道估计的特征数据；102. Perform channel estimation based on the original audio signal and the target audio signal to obtain characteristic data of the channel estimation;

基于音频的手势识别关键点在于需要精确的提取出可进行识别的特征数据。在本实施例中采用分辨率高的原始音频信号进行信道估计，可以提升信道估计的特征数据的提取准确率。但是目标音频信号是原始音频信号经过调制后的音频信号，因此在进行信道估计之前，需要对目标音频信号进行解调，获得其中的基频信号，进一步参与到信道估计中获得信道估计的特征数据。The key point of audio-based gesture recognition lies in the need to accurately extract feature data that can be recognized. In this embodiment, the original audio signal with high resolution is used for channel estimation, which can improve the extraction accuracy of the feature data of the channel estimation. However, the target audio signal is the modulated audio signal of the original audio signal. Therefore, before performing channel estimation, the target audio signal needs to be demodulated to obtain the fundamental frequency signal, and further participate in the channel estimation to obtain the characteristic data of the channel estimation. .

图2示出了本申请提供的一种信道估计方法的流程图，图3示出了本申请提供的一种信道估计原理示意图。FIG. 2 shows a flowchart of a channel estimation method provided by the present application, and FIG. 3 is a schematic diagram of a channel estimation principle provided by the present application.

参考图2和图3，在一个实施例中，基于所述原始音频信号和所述目标音频信号进行信道估计的步骤包括：2 and 3, in one embodiment, the step of performing channel estimation based on the original audio signal and the target audio signal includes:

201、对所述目标音频信号进行解调处理，得到目标基频信号；201. Perform demodulation processing on the target audio signal to obtain a target fundamental frequency signal;

信道估计是基于基频信号进行的，因此本步骤需要将获得的目标音频信号进行解调处理，得到目标基频信号。具体的解调步骤包括：The channel estimation is performed based on the fundamental frequency signal, so in this step, the obtained target audio signal needs to be demodulated to obtain the target fundamental frequency signal. The specific demodulation steps include:

首先对所述目标音频信号进行降载波和IQ分解，得到降载波信号的实部信号和虚部信号；具体的，将接收的目标音频信号降载波获得降载波信号，然后进行IQ分解，即将目标音频信号分别与Acos(2πf_ct)和-Asin(2πf_ct)相乘，其中A为幅度。分解后即可获得降载波信号的实部信号和虚部信号，需要说明的是，降载波和IQ分解没有特定的先后顺序，可以先进性降载波再进行IQ分解，也可以先进行IQ分解再进行降载波，最终结果不会因为处理顺序发生变化。由于获取到的目标音频信号可能会有噪音干扰，因此将获得降载波信号的实部信号和虚部信号后，通过低通滤波器，实现去噪，提高目标基频信号的分辨率。再将去噪后的实部信号和虚部信号合并成复数形式，所述复数形式的基频信号即为目标基频信号。到此步骤，已经实现了将获得的目标音频信号解调并去噪，生成目标基频信号，具备了参与信道估计的初步条件。但此时的目标基频信号是一段很长的信号，直接用于信道估计所获得的信道估计的特征数据准确率不高，因此还要进行分段处理。First, the target audio signal is subjected to down-carrier and IQ decomposition to obtain the real part signal and imaginary part of the down-carrier signal; specifically, the received target audio signal is sub-carrier to obtain a down-carrier signal, and then IQ decomposition is performed, that is, the target audio signal is decomposed by IQ. The audio signal is multiplied by Acos(2πf ct) and _-Asin (2πf ct ₎ , respectively, where A is the amplitude. After the decomposition, the real part signal and imaginary part of the down-carrier signal can be obtained. It should be noted that there is no specific sequence for down-carrier and IQ decomposition. The carrier can be reduced first and then IQ decomposition is performed, or IQ decomposition can be performed first and then IQ decomposition. The carrier is dropped, and the final result will not change due to the processing order. Since the obtained target audio signal may have noise interference, after obtaining the real part signal and imaginary part signal of the reduced carrier signal, the low-pass filter is used to achieve denoising and improve the resolution of the target fundamental frequency signal. Then, the denoised real part signal and the imaginary part signal are combined into a complex number form, and the fundamental frequency signal in the complex number form is the target fundamental frequency signal. At this step, the obtained target audio signal has been demodulated and de-noised to generate the target fundamental frequency signal, and the preliminary conditions for participating in channel estimation have been met. However, the target fundamental frequency signal at this time is a very long signal, and the characteristic data of channel estimation obtained directly for channel estimation is not accurate, so segmentation processing is required.

202、对所述目标基频信号进行分段，得到多个目标信号片段，每个所述目标信号片段的长度均和所述原始音频信号的周期相同；202. Segment the target fundamental frequency signal to obtain a plurality of target signal segments, and the length of each target signal segment is the same as the period of the original audio signal;

原始音频是周期音频信号，因此获得的目标基频信号也有周期的特性。在分段时，可以利用周期的特性，将所述目标基频信号以一个周期为一个片段进行划分切割获得多个目标信号片段，这样获得的目标信号片段的周期长度和原始音频的周期长度是相同的；当然也可以提取出一个周期的原始音频信号，利用单个周期的原始音频信号和所述目标基频信号进行对齐后切割分段，也能够得到和原始音频的周期长度相同的多个目标信号片段。在获得多个目标信号片段之后，可以采用这些信号片段进行信道估计输出信道估计的特征数据。The original audio is a periodic audio signal, so the obtained target fundamental frequency signal also has periodic characteristics. During segmentation, the characteristics of the cycle can be used to divide and cut the target fundamental frequency signal with one cycle as a segment to obtain a plurality of target signal segments. The cycle length of the obtained target signal segment and the cycle length of the original audio are The same; of course, one cycle of the original audio signal can also be extracted, and the single cycle of the original audio signal and the target fundamental frequency signal can be used to align and then cut into segments, and multiple targets with the same cycle length as the original audio can also be obtained. signal fragment. After obtaining a plurality of target signal segments, these signal segments can be used to perform channel estimation and output channel estimation feature data.

203、对于每个所述目标信号片段，均分别与所述原始音频信号中一个周期的信号片段进行信道估计，得到各自的信道特征数据；203. For each of the target signal segments, perform channel estimation with a period of signal segments in the original audio signal, respectively, to obtain respective channel characteristic data;

在步骤202中进行基频信号的分段就是为了在信道估计过程中和原始音频信号对齐，从而获得更准确的信道估计的特征数据。因此在本步骤中，将单个周期的原始音频信号分别与每个目标信号片段进行信道估计，周期长度相等，无需其他处理直接进行信道估计，得到精确的信道估计结果。但此时，获得的是每个信号片段对应的信道特征数据，对应识别出来可能只是目标手势中某个手势的一部分，因此还需要将多个信道特征数据整合到一起，获得连贯的目标手势对应的信道估计的特征数据后，再进行识别，才可以得到准确的目标手势的识别结果。The purpose of segmenting the baseband signal in step 202 is to align with the original audio signal in the channel estimation process, so as to obtain more accurate channel estimation feature data. Therefore, in this step, channel estimation is performed on the original audio signal of a single cycle and each target signal segment respectively, and the cycle lengths are equal. Channel estimation is performed directly without other processing, and an accurate channel estimation result is obtained. However, at this time, the channel feature data corresponding to each signal segment is obtained, and the corresponding recognition may be only a part of a gesture in the target gesture. Therefore, it is necessary to integrate multiple channel feature data together to obtain a coherent target gesture correspondence. After the characteristic data of the channel estimation is obtained, and then the recognition is performed, the accurate recognition result of the target gesture can be obtained.

204、将各个所述目标信号片段的信道特征数据整合，得到所述信道估计的特征数据。204. Integrate the channel characteristic data of each of the target signal segments to obtain the characteristic data of the channel estimation.

整合后的信道特征数据即所述信道估计的特征数据，作为输入进行识别，已经可以输出连贯的目标手势的识别结果。The integrated channel feature data, that is, the feature data of the channel estimation, is used as input for recognition, and a consistent target gesture recognition result can already be output.

上述内容已经清楚的讲解了为了信道估计需要做的数据前处理，接下来将对如何对目标基频信号进行信道估计展开说明。The above content has clearly explained the data preprocessing that needs to be done for channel estimation. Next, it will be explained how to perform channel estimation on the target baseband signal.

信道估计的基本公式为：The basic formula of channel estimation is:

R(n)＝S[n]*h[n]R(n)=S[n]*h[n]

其中*表示卷积，R(n)和S[n]分别表示接收到并解调后的目标基频信号，发射的原始基频信号，h[n]为信道响应，算法的目的是通过R(n)和S[n]求得h[n]。Where * represents convolution, R(n) and S[n] represent the received and demodulated target fundamental frequency signal respectively, the original fundamental frequency signal transmitted, h[n] is the channel response, the purpose of the algorithm is to pass R (n) and S[n] to obtain h[n].

直接进行信道估计算法计算复杂度较高，因此在计算的过程中需要计算的复杂度。因此，本发明的信道估计的求解过程中，采用极大似然估计(Least Square)算法简化计算复杂度。在LS算法中，给定发射的原始音频信号S与接收到的目标音频信号R，计算信道h。根据目标音频信号的处理知，单个周期的信号长度设为d，由此得到向量形式发射信号S和向量形式的接收信号R：The calculation complexity of the channel estimation algorithm is relatively high, so the calculation complexity is required in the calculation process. Therefore, in the solution process of the channel estimation of the present invention, the maximum likelihood estimation (Least Square) algorithm is used to simplify the computational complexity. In the LS algorithm, given the transmitted original audio signal S and the received target audio signal R, the channel h is calculated. According to the processing of the target audio signal, the signal length of a single cycle is set to d, so that the vector form of the transmitted signal S and the vector form of the received signal R are obtained:

S＝{s₁，s₂，…，s_d-1，s_d}S={s ₁ , s ₂ , ..., s _d-1 , s _d }

R＝{r₁，r₂，…，r_d-1，r_d}R={r ₁ , r ₂ ,..., r _d-1 , r _d }

进一步构建极大似然估计算法矩阵：Further construct the maximum likelihood estimation algorithm matrix:

其中小标l与p满足公式l+p＝d，直接计算矩阵h的算法复杂度为O(l³+lp²)，采用极大似然估计算法，直接使用

MR求解出信道向量，其中(M^TM)^-1M的数值可以预先通过原始音频信号S求解得出，大大简化了算法的复杂度。Among them, the small scale l and p satisfy the formula l+p=d, and the algorithm complexity of directly calculating the matrix h is O(l ³ +lp ² ). The maximum likelihood estimation algorithm is used to directly use

MR solves the channel vector, wherein the value of (M ^T M) ^-1 M can be obtained by solving the original audio signal S in advance, which greatly simplifies the complexity of the algorithm.

举例说明，对于输入的每个周期480比特的原始音频信号，使用极大似然估计算法进行计算，可以求解出长度为l＝140比特的信道估计数据。通常单个手势运动的时间在0.3秒至1.4秒之间，故系统选择1.4秒为窗口长度，此时可计算出一个窗口内包含140组信道估计数据。这构成用于信道估计的特征数据，大小为140*140。显然，这样计算比直接求解信道估计公式简单很多。For example, for the input original audio signal of 480 bits per cycle, the maximum likelihood estimation algorithm is used for calculation, and the channel estimation data with a length of 1=140 bits can be obtained. Usually the time of a single gesture movement is between 0.3 seconds and 1.4 seconds, so the system selects 1.4 seconds as the window length, and at this time, it can be calculated that one window contains 140 sets of channel estimation data. This constitutes the feature data for channel estimation, with a size of 140*140. Obviously, this calculation is much simpler than directly solving the channel estimation formula.

在实际应用过程中，不同的手势幅度有大有小，当手掌在做细粒度手势时，信道变化较小。若希望清楚地追踪到信道变化，可以采用下述方式消除反射信号中的静态反射部分。In the actual application process, different gesture amplitudes are large and small. When the palm is making fine-grained gestures, the channel change is small. If it is desired to clearly track channel changes, the static reflections in the reflected signal can be eliminated in the following manner.

在一个实施例中，静态消除的具体步骤包括：每个所述目标信号片段均带有时间戳，将各个所述目标信号片段的信道特征数据分别表示为向量的形式，得到各个所述目标信号片段的特征向量；将各个所述目标信号片段的特征向量按照各自对应的时间戳大小进行排列，构成特征矩阵；对于所述特征矩阵中的每一列，均将各自包含的各个元素值分别减去前一列包含的各个对应元素值，得到静态消除后的所述特征矩阵；将静态消除后的所述特征矩阵确定为所述信道估计的特征数据。对静态消除后信道估计的特征数据进行识别，即使是细粒度手势也能够给出精确的结果。In one embodiment, the specific step of static elimination includes: each of the target signal segments has a time stamp, and the channel feature data of each of the target signal segments is represented in the form of a vector, to obtain each of the target signal segments. The feature vector of the segment; the feature vectors of each of the target signal segments are arranged according to their respective timestamps to form a feature matrix; for each column in the feature matrix, the respective element values contained in the respective elements are subtracted. Each corresponding element value contained in the previous column is used to obtain the feature matrix after static elimination; the feature matrix after static elimination is determined as the feature data of the channel estimation. Identifying the feature data of channel estimation after static elimination can give accurate results even for fine-grained gestures.

以一组手势信道估计的特征数据为例，静态消除的效果参见图4和图5，其中图4示出了经过极大似然估计的信道估计算法后，得到的信道估计的特征数据对应的热度图，从图中可以看到，有很多干扰信号掺杂在其中，这些信号绝大部分都是由于静态反射造成；再观察图5，图5示出了静态消除后信道估计的特征数据的热度图，显然，此时干扰信号已经很少了，其中箭头指出来的放大图是一个手势产生的信道估计的特征数据的热度图。从两张图对比可以看出，即使是细粒度的手势，通过静态消除的处理，也可以精确获取。Taking a set of feature data of gesture channel estimation as an example, the effect of static elimination is shown in Figure 4 and Figure 5, where Figure 4 shows the channel estimation feature data corresponding to the obtained channel estimation algorithm after maximum likelihood estimation. The heat map, it can be seen from the figure that there are many interfering signals doped in it, most of these signals are caused by static reflection; then look at Figure 5, which shows the characteristic data of channel estimation after static elimination. The heat map, obviously, there are very few interference signals at this time, and the enlarged image pointed out by the arrow is a heat map of the feature data of channel estimation generated by a gesture. It can be seen from the comparison of the two figures that even fine-grained gestures can be accurately obtained through static elimination.

至此，已经完成了基于音频的手势识别的目标手势的特征数据提取部分：即对目标音频信号进行解调、降噪等处理获得目标基频信号；利用原始音频的周期特性，将目标基频信号分割成和单个周期原始音频信号周期长度相等的多个目标信号片段；利用单个周期的原始音频信号分别与每个目标信号片段做信号估计，得到各自的信道特征数据；将这些信道特征数据按照时序进行排列形成矩阵后，矩阵中每一列减去前一列包含的各个对应的元素值，获得静态消除后的特征矩阵，也即信道估计的特征数据。采用基频信号作为信道估计的基础，所获得的信道估计的特征数据精度已经显著提高；再经过静态消除的处理，即使是细粒度手势也能够准确提取出对应的信道估计的特征数据作为后续识别的输入，手势识别结果也更准确。So far, the feature data extraction part of the target gesture of audio-based gesture recognition has been completed: that is, the target audio signal is processed by demodulation, noise reduction, etc. to obtain the target fundamental frequency signal; using the periodic characteristics of the original audio, the target fundamental frequency signal Divide into multiple target signal segments with the same length as a single cycle of the original audio signal; use the single cycle of the original audio signal to do signal estimation with each target signal segment respectively to obtain their respective channel feature data; these channel feature data according to the time series After arranging to form a matrix, each column in the matrix is subtracted from each corresponding element value contained in the previous column to obtain a statically eliminated feature matrix, that is, the feature data of channel estimation. Using the fundamental frequency signal as the basis for channel estimation, the accuracy of the obtained channel estimation feature data has been significantly improved; after static elimination, even fine-grained gestures can accurately extract the corresponding channel estimation feature data for subsequent identification. The input, gesture recognition results are also more accurate.

103、对所述信道估计的特征数据进行识别，得到所述目标手势的识别结果。103. Identify the feature data of the channel estimation to obtain a recognition result of the target gesture.

对所述信道估计的特征数据进行识别，得到所述目标手势的识别结果。现有技术中，基础卷积神经网络模型可以实现对于预设手势的识别，但在实验观察中发现，不同测试人员对预设的不同手势动作习惯并不一致，如对上下挥手动作的手势，不同被测人员的手势动作速度快慢、幅度大小可能不同。因此利用基础神经网络模型进行手势识别，尽管当给出足够多的训练数据集的情况下，神经网络可以正确识别手势，但这种现象将影响系统泛化时的识别准确率。Identify the feature data of the channel estimation to obtain the recognition result of the target gesture. In the prior art, the basic convolutional neural network model can realize the recognition of preset gestures, but in the experimental observation, it is found that different testers have different habits of different preset gestures, such as gestures of waving up and down. The speed and magnitude of the gestures of the tested persons may be different. Therefore, the basic neural network model is used for gesture recognition. Although the neural network can correctly recognize gestures when given enough training data sets, this phenomenon will affect the recognition accuracy when the system is generalized.

为了解决这一实际问题，本发明将域自适应(domain adaptation)网络结构应用到手势识别算法中，域(domain)可以定义成在识别算法中和产生数据集的个体相关，和识别目的无关的集合。比如在手势识别系统中，不同测试者的手势习惯带来的特征可以定义为域，测试者所处的不同环境引入的特征数据也可以定义为域。域自适应网络识别的主要功能在于：当用大量标注了域也标注了识别结果的数据和大量标注了域但未标注识别结果的数据集时，可以实现对域无关的特征值的提取，优化识别准确率。In order to solve this practical problem, the present invention applies the domain adaptation network structure to the gesture recognition algorithm. The domain can be defined as the individual related to the data set generated in the recognition algorithm and irrelevant to the recognition purpose. gather. For example, in a gesture recognition system, the features brought by the gesture habits of different testers can be defined as domains, and the feature data introduced by different environments where the testers are located can also be defined as domains. The main function of domain adaptive network recognition is: when using a large number of data marked with the domain and the recognition result and a large number of data sets marked with the domain but not marked with the recognition result, it can realize the extraction of domain-independent eigenvalues and optimize them. recognition accuracy.

在一个实施例中，所述对所述信道估计的特征数据进行识别，得到所述目标手势的识别结果包括：In one embodiment, the identifying the feature data of the channel estimation to obtain the identification result of the target gesture includes:

采用预先构建的域自适应神经网络模型对所述信道估计的特征数据进行识别，得到所述目标手势的识别结果；具体的，首先将所述信道估计的特征数据输入到预先训练的域自适应网络模型中，通过神经卷积网络提取出所述目标手势的特征；通过全连接网络层与激活函数识别所述目标手势的特征，输出所述目标手势的识别结果；或获取所述目标手势的识别结果，计算所述目标手势的识别结果的准确率：当所述准确率落在预设准确率阈值范围内，输出所述目标手势的识别结果。当所述准确率没落在预设准确率阈值范围内，重复提取识别所述目标手势的特征进行识别，直至所述准确率落在预设准确率阈值范围内输出符合要求的目标手势的识别结果。A pre-built domain adaptive neural network model is used to identify the feature data of the channel estimation, and the recognition result of the target gesture is obtained; specifically, the feature data of the channel estimation is firstly input into the pre-trained domain adaptive In the network model, the characteristics of the target gesture are extracted through a neural convolutional network; the characteristics of the target gesture are identified through a fully connected network layer and an activation function, and the recognition result of the target gesture is output; or the target gesture is obtained. For the recognition result, calculate the accuracy rate of the recognition result of the target gesture: when the accuracy rate falls within the range of a preset accuracy rate threshold, output the recognition result of the target gesture. When the accuracy rate falls within the preset accuracy rate threshold range, repeatedly extracting and recognizing the features of the target gesture for recognition, until the accuracy rate falls within the preset accuracy rate threshold value range, output the recognition result of the target gesture that meets the requirements .

除了预设准确率阈值范围作为目标提高识别的准确率，最根本的方法还是要构建出收敛的域自适应神经网络模型。图6示出了本申请提供的域自适应神经网络模型的结构。In addition to the preset accuracy threshold range as the goal to improve the accuracy of recognition, the most fundamental method is to build a convergent domain adaptive neural network model. FIG. 6 shows the structure of the domain adaptive neural network model provided by this application.

参考图6，在一个实施例中，所述域自适应神经网络模型通过以下步骤构建：6, in one embodiment, the domain adaptive neural network model is constructed by the following steps:

在一个实施例中，所述以所述第一训练数据集和所述第二训练数据集作为训练集，训练得到所述域自适应神经网络模型包括：In one embodiment, using the first training data set and the second training data set as training sets, the training to obtain the domain adaptive neural network model includes:

将所述第一训练数据集输入初始域自适应神经网络模型，得到所述第一训练数据集的手势预测结果和域预测结果；根据所述第一训练数据集已分配的域标签和手势标签，以及所述第一训练数据集的手势预测结果和域预测结果，计算得到第一交叉熵E_label；将所述第二训练数据集输入所述初始域自适应神经网络模型，得到所述第二训练数据集的手势预测结果和域预测结果；根据所述第二训练数据集已分配的域标签，以及所述第二训练数据集的手势预测结果和域预测结果，计算得到第二交叉熵E_unlab。Input the first training data set into the initial domain adaptive neural network model to obtain the gesture prediction result and domain prediction result of the first training data set; according to the assigned domain label and gesture label of the first training data set , and the gesture prediction result and domain prediction result of the first training data set, the first cross entropy E _label is obtained by calculation; the second training data set is input into the initial domain adaptive neural network model to obtain the first cross entropy E label ; 2. The gesture prediction result and the domain prediction result of the second training data set; according to the assigned domain label of the second training data set, and the gesture prediction result and the domain prediction result of the second training data set, the second cross entropy is obtained by calculating _Eunlab .

根据所述第一训练数据集已分配的域标签、所述第二训练数据集已分配的域标签、所述所述第一训练数据集的域预测结果以及所述第二训练数据集的域预测结果，计算得到域交叉熵E_s；根据所述第一交叉熵、所述第二交叉熵和所述域交叉熵计算所述初始域自适应神经网络模型的交叉熵E＝E_label+αE_unlab-βE_s,其中α与β为域自适应神经网络模型的模型参数；以所述交叉熵E取最小值为目标，对所述初始域自适应神经网络模型的模型参数进行优化，得到所述域自适应神经网络模型。Based on the assigned domain labels of the first training dataset, the assigned domain labels of the second training dataset, the domain prediction results of the first training dataset and the domain of the second training dataset The prediction result is calculated to obtain the domain cross entropy E _s ; according to the first cross entropy, the second cross entropy and the domain cross entropy, the cross entropy of the initial domain adaptive neural network model is calculated E=E _label +αE _unlab- βE _s , wherein α and β are the model parameters of the domain adaptive neural network model; taking the minimum value of the cross entropy E as the target, the model parameters of the initial domain adaptive neural network model are optimized to obtain the obtained The domain adaptive neural network model.

本发明的目的是手势识别，对训练数据集均进行域标签的分配，其中一个训练数据集进行手势标签分配，另一个训练数据集不分配手势标签，是为了通过两个训练数据集降低和手势无关的特征对识别结果的影响。在训练时，由于训练样本偏少，对于新用户和新的环境都可以视为新的域，因此每一位新用户的数据，我们可以当成分配了域标签(新域)，但是未标注手势的第二数据集。The purpose of the present invention is gesture recognition, and all training data sets are assigned domain labels, one of which is assigned gesture labels, and the other is not assigned gesture labels. The influence of irrelevant features on the recognition results. During training, due to the small number of training samples, both new users and new environments can be regarded as new domains, so the data of each new user can be regarded as assigned domain labels (new domains), but no gestures are marked the second dataset.

然后通过预先构建的域自适应神经网络模型进行特征提取，进行手势识别、域识别，分别构建第一交叉熵、第二交叉熵和域交叉熵。进一步构建出域自适应网络模型的交叉熵。通过学习模型，减小域独有的特征数据的权重。此模型有两个优点，首先提高和域无关的特征数据集权重，增加准确度。其次，对于新用户的数据，可以直接输入网络进行训练，得到的新的模型，当用于另外的新用户时，会比当前模型更好。这样，即使当用户首次使用系统，系统无法预先知道用户的个人动作习惯以及所处环境，但随着不断有新用户未标注的手势数据加入，网络将学习出所有用户都适配的手势识别特征，同时减小环境以及个人手势习惯等域产生的特征数据对识别准确率影响。Then, feature extraction is performed through the pre-built domain adaptive neural network model, gesture recognition and domain recognition are performed, and the first cross entropy, the second cross entropy and the domain cross entropy are respectively constructed. The cross entropy of the domain adaptive network model is further constructed. By learning the model, the weight of domain-specific feature data is reduced. This model has two advantages. First, it increases the weight of the domain-independent feature dataset and increases the accuracy. Secondly, for the data of new users, it can be directly input into the network for training, and the new model obtained will be better than the current model when used for other new users. In this way, even when the user uses the system for the first time, the system cannot know the user's personal behavior and environment in advance, but with the continuous addition of unlabeled gesture data from new users, the network will learn gesture recognition features that are suitable for all users. , while reducing the influence of the feature data generated by the environment and personal gesture habits on the recognition accuracy.

具体的，要实现第一交叉熵和第二交叉熵的计算，步骤可以为：首先使用5层卷积神经网络进行特征提取，即将“输入→[卷积→ReLu→池化]*N→全连接层→输出”的结构中N设为5，每一层卷积层的结构类似，具有一层卷积层和一层池化层。第一、二、三层的卷积核数分别设置为32、64、128、256、512，卷积的大小分别为[10*10]，[5*5]，[5*5]，[5*5]、[2*2]，最终从输入的两个训练数据集中将特征提取出来，定义δ为相关的参数，X为输入训练数据集，可定义提取的特征F，具体关系式为：Specifically, to realize the calculation of the first cross-entropy and the second cross-entropy, the steps can be as follows: First, use a 5-layer convolutional neural network to perform feature extraction, that is, "input→[convolution→ReLu→pooling]*N→full In the structure of "connection layer→output", N is set to 5, and the structure of each convolutional layer is similar, with one convolutional layer and one pooling layer. The number of convolution kernels of the first, second, and third layers are set to 32, 64, 128, 256, and 512, respectively, and the convolution sizes are [10*10], [5*5], [5*5], [ 5*5], [2*2], and finally extract the features from the two input training data sets, define δ as the relevant parameter, X is the input training data set, and the extracted feature F can be defined, and the specific relationship is :

F＝CNN(X，δ)F=CNN(X,δ)

将提取出的特征F，通过全连接层得到参数Z，定义W_a1与b_a1为全连接网络层的参数，通过神经网络的学习过程求解式为：The extracted feature F is obtained through the fully connected layer to obtain the parameter Z, and W _a1 and b _a1 are defined as the parameters of the fully connected network layer, and the formula is solved through the learning process of the neural network:

Z＝softplus(W_a1F+b_a1)Z=softplus(W _a1 F+b _a1 )

全连接的结果Z通过softmax层获得预测结果

定义W_a2与b_a2为预测层中的参数，通过神经网络学习过程求解，得到下述公式：The fully connected result Z obtains the prediction result through the softmax layer

Define W _a2 and b _a2 as the parameters in the prediction layer, and solve them through the neural network learning process, and obtain the following formula:

其中，

表示标注了手势的第一训练数据集和未标注手势的第二训练数据集提取的预测结果集合，依据训练数据集拆分

得到

和

分别表示第一训练数据集提取得到的手势预测结果，和第二训练数据集提取得到的手势预测结果，同理，划分输入数据为X^label和X^unlab。in,

Represents the set of prediction results extracted from the first training dataset with gestures and the second training dataset without gestures, split according to the training dataset

get

and

respectively represent the gesture prediction result extracted from the first training data set and the gesture prediction result extracted from the second training data set. Similarly, the input data is divided into X ^label and X ^unlab .

已知真实的标签y^label和预测的标签

可以计算得到交叉熵(cross entropy)E_label，Knowing the true label y ^label and the predicted label

The cross entropy (cross entropy) E _label can be calculated,

其中|X^label|表示标记手势的第一训练数据集的个数，N表示需要识别的手势动作个数。Where |X ^label | represents the number of the first training data set for marking gestures, and N represents the number of gesture actions to be recognized.

对于未标记真实手势的第二训练数据集，只使用

预测的手势结果计算交叉熵E_unlab，For the second training dataset, which is not labeled with real gestures, only use

The predicted gesture result calculates the cross entropy E _unlab ,

|X^unlab|表示未标记手势的第二训练数据集的个数。|X ^unlab | represents the number of unlabeled gestures in the second training dataset.

完成了跟手势有关的交叉熵的计算，接下来对域相关的交叉熵进行计算。具体步骤包括，域辨别网络部分的输入S定义为特征提取部分和手势识别结果的串联，After completing the calculation of the gesture-related cross-entropy, the domain-related cross-entropy is calculated next. The specific steps include that the input S of the domain discrimination network part is defined as the concatenation of the feature extraction part and the gesture recognition result,

其中F是特征提出部分的结果，其包含标记手势的第一训练数据集和未标记手势的第二训练数据集；

是识别部分的结果，⊙表示串联。参数F表示训练数据集中标注了手势的数据和未标注手势的数据的集合，域辨别网络由两层全连接层构成，定义第一层全连接层输出为T:where F is the result of the feature proposal part, which contains a first training dataset of labeled gestures and a second training dataset of unlabeled gestures;

is the result of the recognition part, and ⊙ means concatenation. The parameter F represents the set of data marked with gestures and data not marked with gestures in the training data set. The domain discrimination network consists of two fully connected layers, and the output of the first fully connected layer is defined as T:

T＝Softplus(W_s1S+B_s1)T=Softplus(W _s1 S+B _s1 )

第二层全连接网络层的输出

其中W_s1，W_s2和B_s1，B_s2为全连接网络需要学习的参数，W是权值矩阵，B是偏置，下标代表层数。在域识别网络结构中，定义交叉熵E_s作为损失函数，其中S_ij是one-hot形式的域标签向量，|X|表示所有属于S的样本个数。The output of the second fully connected network layer

Among them, W _s1 , W _s2 and B _s1 , B _s2 are the parameters to be learned by the fully connected network, W is the weight matrix, B is the bias, and the subscript represents the number of layers. In the domain recognition network structure, the cross-entropy E _s is defined as the loss function, where S _ij is the domain label vector in one-hot form, and |X| represents the number of all samples belonging to S.

为了方便理解，举例说明。训练时，假设有10000组140*140分配域标签和手势标签的训练数据构成第一训练数据集，1000组140*140的分配了域标签未分配手势标签的训练数据构成第二训练数据集。首先将两个训练数据集都通过特征提取的卷积神经网络，在卷积层后面加全连接层和激活函数，将得到预测的结果。For ease of understanding, examples are given. During training, it is assumed that there are 10,000 groups of 140*140 training data assigned domain labels and gesture labels to form the first training data set, and 1000 groups of 140*140 training data that are assigned domain labels but not assigned gesture labels constitute the second training data set. First, the two training data sets are passed through the convolutional neural network for feature extraction, and the fully connected layer and the activation function are added after the convolutional layer, and the predicted result will be obtained.

需要注意的是，对于每一组输入数据，都可以通过模型得到标注预测的结果。那么对于第一训练数据集，可以获得10000个手势预测结果，所述10000个手势预测结果和原有的10000个实际手势可以计算第一交叉熵E_label。对于第二训练数据集，仅可以获得1000个手势预测结果，由于没有真实手势，所以假设1000个手势预测结果为原有的1000个实际手势，1000个手势预测结果和1000个手势预测结果可以计算出第二交叉熵函数E_label。It should be noted that for each set of input data, the result of annotation prediction can be obtained through the model. Then, for the first training data set, 10,000 gesture prediction results can be obtained, and the first cross entropy E _label can be calculated from the 10,000 gesture prediction results and the original 10,000 actual gestures. For the second training data set, only 1000 gesture prediction results can be obtained. Since there are no real gestures, assuming 1000 gesture prediction results are the original 1000 actual gestures, 1000 gesture prediction results and 1000 gesture prediction results can be calculated The second cross entropy function E _label is obtained.

其中经过卷积神经网络进行提取得到的特征是flatten以后的数据，即一维数据，这里的flatten可以理解为降维，假设池化以后还剩2*2*500的结构，flatten以后就是2000长度的一维数据。因此不论是对第一训练数据集还是第二训练数据集进行特征提取后，得到的都是一维数据。The features extracted by the convolutional neural network are the data after flatten, that is, one-dimensional data. The flatten here can be understood as dimensionality reduction. Assuming that there is a 2*2*500 structure left after pooling, the length after flatten is 2000 one-dimensional data. Therefore, after feature extraction is performed on the first training data set or the second training data set, one-dimensional data is obtained.

假设一组一维数据为512比特，第一训练数据集对应的是8组手势，手势识别部分的数据结果就是一个onehot(独热编码)形式的8bit一维数组(1，0，0，0，0，0，0，0)，一维数组中第几个是1就代表是第几个手势。这样可以很简单的将组长度为512比特的数据和这8个数据串联起来得到520长度的数据，此时只需要简单的全连接和激活函数，就可以进行域的识别。对于所有训练数据集，不管是标注还是未标注，都是已经给定了域，因此可以直接计算域识别交叉熵Es。结合上面计算出来的E_label和E_label，组合成域自适应神经网络模型的交叉熵。Assuming that a set of one-dimensional data is 512 bits, the first training data set corresponds to 8 sets of gestures, and the data result of the gesture recognition part is an 8-bit one-dimensional array (1, 0, 0, 0) in the form of onehot (one-hot encoding). , 0, 0, 0, 0), the first 1 in the one-dimensional array represents the first gesture. In this way, data with a group length of 512 bits can be easily concatenated with these 8 pieces of data to obtain data with a length of 520. At this time, only a simple full connection and activation function are needed to identify the domain. For all training datasets, whether labeled or unlabeled, the domain has been given, so the domain recognition cross-entropy Es can be directly calculated. Combined with the E _label and E _label calculated above, the cross entropy of the domain adaptive neural network model is combined.

E＝E_label+αE_unlab-βE_s E=E _label +αE _unlab -βE _s

已经知道整个网络的损失函数E，就可以直接使用梯度下降算法对网络参数进行训练。利用系统已经得到第一训练数据集的交叉熵E_label和第二训练数据集的交叉熵E_unlab，定义

代表为计算这两组交叉熵时需要优化的参数集合，同时定义

为神经网络模型域识别部分计算域交叉熵E_s时需要优化的参数集合，以及神经网络识别结构最终实现域自适应识别需要优化的参数集合

系统构建新的交叉熵E实现对参数集合

的学习，其中α和β域自适应神经网络的模型参数，Knowing the loss function E of the entire network, the network parameters can be trained directly using the gradient descent algorithm. Using the system to obtain the cross entropy E _label of the first training data set and the cross entropy E _unlab of the second training data set, define

Represents the set of parameters that need to be optimized when calculating the two sets of cross entropy, and defines

The parameter set that needs to be optimized when calculating the domain cross entropy E _s for the domain identification part of the neural network model, and the parameter set that needs to be optimized for the final realization of domain adaptive identification by the neural network identification structure

The system constructs a new cross-entropy E to realize the set of parameters

learning, where the α and β domains adapt the model parameters of the neural network,

E＝E_label+αE_unlab-βE_s E=E _label +αE _unlab -βE _s

神经网络的训练部分的基本思想是寻找最终的

和

这三组参数的集合，即当给定合适的α和β两个模型参数时，令E的值最小。The basic idea of the training part of the neural network is to find the final

and

The set of these three sets of parameters, that is, when the appropriate two model parameters, α and β, are given, the value of E is minimized.

具体的，训练的时候，采用的梯度下降方法计算的理论公式如下：Specifically, during training, the theoretical formula calculated by the gradient descent method used is as follows:

其中μ表示学习速率。where μ is the learning rate.

由于实时的系统设计能耗非常高，不可能一直调用神经网络模型进行识别。因此实时系统中还可以设计动作检测。Due to the very high energy consumption of real-time system design, it is impossible to always call the neural network model for recognition. Therefore, motion detection can also be designed in real-time systems.

由前文可知实时进行信道估计时，每输入480比特数据就估算出140比特大小的信道数据。经实验论证，当存在手掌运动时，得到140比特的信道特征数据h的方差很大，其中h和前述含义相同，在这里指的是周期长度为480比特的目标信号频段和480比特的原始音频信号进行信道估计后得出的信道特征数据。参考图7，图中示出了调用神经网络模型阈值判定参考图，因此使用h的方差作为阈值判定条件。当某一时刻的h方差>1/10平均方差数据，系统判断疑似有手势，给出前1.4s时刻的信道估计的特征数，去调用神经网络模型进行识别，就可以避免一直调用神经网络模型，设备的能耗过高的问题。当然，所述模型可以是普通的神经网络模型，也可以是域自适应神经网络模型或其他可以运用的网络模型，在此不限制网络模型的类型。It can be seen from the foregoing that when channel estimation is performed in real time, channel data with a size of 140 bits is estimated for every 480 bits of data input. It has been proved by experiments that when there is palm movement, the variance of the 140-bit channel characteristic data h is very large, where h has the same meaning as above, and here refers to the target signal frequency band with a period length of 480 bits and the original audio frequency of 480 bits. The channel characteristic data obtained after channel estimation is performed on the signal. Referring to FIG. 7 , the figure shows the reference graph for calling the neural network model threshold value determination, so the variance of h is used as the threshold value determination condition. When the h-variance at a certain moment is greater than 1/10 of the average variance data, the system judges that there is a suspected gesture, gives the number of features of the channel estimation in the first 1.4s, and calls the neural network model for identification, which can avoid calling the neural network model all the time. The problem of high energy consumption of equipment. Of course, the model may be a common neural network model, a domain adaptive neural network model or other applicable network models, and the type of the network model is not limited here.

本发明公开的手势识别方法，基于基频信号的信道估计实现目标手势的信道估计的特征数据提取，并通过预先构建的域自适应神经网络模型对所述信道估计的特征数据进行识别，不仅在特征数据提取部分能够获得准确的数据提取结果，而且采用域自适应神经网络模型进行识别可以降低和识别结果无关的域特征权重，从而提高识别的准确率以及识别的泛化性，使得本发明所提供的方案具有更广阔的应用前景。The gesture recognition method disclosed in the present invention realizes the feature data extraction of the channel estimation of the target gesture based on the channel estimation of the fundamental frequency signal, and recognizes the feature data of the channel estimation through the pre-built domain adaptive neural network model. The feature data extraction part can obtain accurate data extraction results, and the use of the domain adaptive neural network model for identification can reduce the domain feature weight irrelevant to the identification results, thereby improving the accuracy of identification and the generalization of identification. The provided scheme has broader application prospects.

然而尽管前述方案已经能够显著提高手势识别的准确率和系统识别的泛化性，但是仍存在一个问题，即当系统只使用单麦克风和单扬声器时，获得的目标音频信号只能计算出1维信道估计的特征数据。如果两种手势动作呈镜像，系统将无法进行识别，如手势顺时针旋转和手势逆时针旋转。However, although the aforementioned solutions have been able to significantly improve the accuracy of gesture recognition and the generalization of system recognition, there is still a problem, that is, when the system only uses a single microphone and a single speaker, the obtained target audio signal can only be calculated in one dimension. Characteristic data for channel estimation. If the two gesture actions are mirrored, the system will not recognize it, such as gesture clockwise rotation and gesture counterclockwise rotation.

为解决单个麦克风设备进行目标音频信号采集无法识别镜像手势的问题，在一个实施例中，获取目标音频信号时，选择由N个不同的麦克风设备分别采集到的N份所述目标音频信号，N为大于1的整数；所述基于所述原始音频信号和所述目标音频信号进行信道估计，得到信道估计的特征数据包括：对于每份所述目标音频信号，均分别与所述原始音频信号进行信道估计，得到N份信道特征数据；对所述N份信道特征数据进行整合，得到所述信道估计的特征数据。采用多个音频采集设备进行目标音频信号的多维度采集处理，可以获取到更多信道估计的特征数据，能够识别镜像手势，并提高手势识别结果。In order to solve the problem that a single microphone device cannot recognize the mirror gesture when collecting the target audio signal, in one embodiment, when acquiring the target audio signal, N parts of the target audio signal collected by N different microphone devices are selected, and N is an integer greater than 1; the performing channel estimation based on the original audio signal and the target audio signal, and obtaining the characteristic data of the channel estimation includes: for each copy of the target audio signal, performing the channel estimation with the original audio signal respectively Channel estimation is performed to obtain N pieces of channel characteristic data; the N pieces of channel characteristic data are integrated to obtain the characteristic data of the channel estimation. Using multiple audio acquisition devices to perform multi-dimensional acquisition and processing of target audio signals can obtain more feature data for channel estimation, recognize mirror gestures, and improve gesture recognition results.

显然使用多组麦克风接收到的目标音频信号进行信道估计时，能获得更多的特征信息。举例说明，假设麦克风个数为M，用户做出单个手势时，系统将获得M个独立的信道估计的特征数据，此时单个手势对应的信道估计的特征数据大小为140*140*M；进一步可以从多维度进行手势识别，将镜像手势区分开，获得更准确的手势识别结果。Obviously, when channel estimation is performed using the target audio signals received by multiple sets of microphones, more feature information can be obtained. For example, assuming that the number of microphones is M, when the user makes a single gesture, the system will obtain M independent channel estimation feature data. At this time, the size of the channel estimation feature data corresponding to a single gesture is 140*140*M; further Gesture recognition can be performed from multiple dimensions, distinguish mirror gestures, and obtain more accurate gesture recognition results.

图8示出了本申请实施例提供的基于音频的手势识别装置的结构框图，为了便于说明，仅示出了与本申请实施例相关的部分。FIG. 8 shows a structural block diagram of an audio-based gesture recognition apparatus provided by an embodiment of the present application. For convenience of description, only parts related to the embodiment of the present application are shown.

参照图8，在一个实施例中，基于音频的手势识别装置包括：8, in one embodiment, the audio-based gesture recognition device includes:

信号获取模块301，用于获取目标音频信号，所述目标音频信号为预设的原始音频信号在调制后，传播经过用户做出的目标手势后接收到的音频信号；A signal acquisition module 301, configured to acquire a target audio signal, the target audio signal is an audio signal received after a preset original audio signal is modulated and propagated through a target gesture made by a user;

信道估计模块302，用于基于所述原始音频信号和所述目标音频信号进行信道估计，得到信道估计的特征数据；a channel estimation module 302, configured to perform channel estimation based on the original audio signal and the target audio signal to obtain characteristic data of the channel estimation;

手势识别模块303，用于对所述信道估计的特征数据进行识别，得到所述目标手势的识别结果。The gesture recognition module 303 is configured to recognize the feature data of the channel estimation to obtain the recognition result of the target gesture.

在一个实施例中，所述信道估计模块302可以包括：In one embodiment, the channel estimation module 302 may include:

信号调解单元，用于对所述目标音频信号进行解调处理，得到目标基频信号；a signal modulation unit for demodulating the target audio signal to obtain a target fundamental frequency signal;

信号分段单元，用于对所述目标基频信号进行分段，得到多个目标信号片段，每个所述目标信号片段的长度均和所述原始音频信号的周期相同；a signal segmenting unit, configured to segment the target fundamental frequency signal to obtain a plurality of target signal segments, and the length of each target signal segment is the same as the period of the original audio signal;

信道估计单元，用于对于每个所述目标信号片段，均分别与所述原始音频信号中一个周期的信号片段进行信道估计，得到各自的信道特征数据；a channel estimation unit, configured to perform channel estimation on each of the target signal segments with a period of signal segments in the original audio signal to obtain respective channel characteristic data;

数据整合单元，用于将各个所述目标信号片段的信道特征数据整合，得到所述信道估计的特征数据。The data integration unit is configured to integrate the channel characteristic data of each of the target signal segments to obtain the characteristic data of the channel estimation.

在一个实施例中，所述信号调解单元可以包括：In one embodiment, the signal conditioning unit may include:

信号分解子单元，用于对所述目标音频信号进行降载波和IQ分解a signal decomposition sub-unit, used to perform down-carrier and IQ decomposition on the target audio signal

，得到降载波信号的实部信号和虚部信号；, obtain the real part signal and imaginary part signal of the down-carrier signal;

信号去噪子单元，用于使用低通滤波器对所述降载波信号的实部信号和虚部信号去噪，得到所述目标基频信号。A signal denoising subunit, configured to use a low-pass filter to denoise the real part signal and the imaginary part signal of the down-carrier signal to obtain the target fundamental frequency signal.

在一个实施例中，每个所述目标信号片段均带有时间戳，所述数据整合单元可以包括：In one embodiment, each of the target signal segments is time stamped, and the data integration unit may include:

数据转换子单元，用于将各个所述目标信号片段的信道特征数据分别表示为向量的形式，得到各个所述目标信号片段的特征向量；a data conversion subunit, used to represent the channel feature data of each of the target signal segments as a vector, respectively, to obtain a feature vector of each of the target signal segments;

数据排列子单元，用于将各个所述目标信号片段的特征向量按照各自对应的时间戳大小进行排列，构成特征矩阵；a data arrangement subunit, used for arranging the feature vectors of each of the target signal segments according to their respective time stamp sizes to form a feature matrix;

静态消除子单元，用于对于所述特征矩阵中的每一列，均将各自包含的各个元素值分别减去前一列包含的各个对应元素值，得到静态消除后的所述特征矩阵；将静态消除后的所述特征矩阵确定为所述信道估计的特征数据。The static elimination subunit is used for each column in the feature matrix to deduct the respective element values contained in the respective element values from the corresponding element values contained in the previous column to obtain the statically eliminated feature matrix; the static elimination The latter feature matrix is determined as feature data of the channel estimation.

在一个实施例中，手势识别模块303可以包括：In one embodiment, gesture recognition module 303 may include:

手势识别单元，用于采用预先构建的域自适应神经网络模型对所述信道估计的特征数据进行识别，得到所述目标手势的识别结果；a gesture recognition unit, configured to use a pre-built domain adaptive neural network model to recognize the feature data of the channel estimation, and obtain the recognition result of the target gesture;

其中，所述手势识别单元中的域自适应神经网络模型通过以下单元构建：Wherein, the domain adaptive neural network model in the gesture recognition unit is constructed by the following units:

模型构建单元，用于获取第一训练数据集和第二训练数据集，所述第一训练数据集包含多组已分配域标签和手势标签的信道估计样本数据，所述第二训练数据集包含多组已分配域标签但未分配手势标签的信道估计样本数据；A model building unit, configured to obtain a first training data set and a second training data set, the first training data set includes multiple groups of channel estimation sample data that have been assigned domain labels and gesture labels, and the second training data set includes Multiple sets of channel estimation sample data with domain labels assigned but no gesture labels assigned;

模型训练单元，用于以所述第一训练数据集和所述第二训练数据集作为训练集，训练得到所述域自适应神经网络模型。A model training unit, configured to use the first training data set and the second training data set as training sets to train to obtain the domain adaptive neural network model.

在一个实施例中，所述模型训练单元可以包括：In one embodiment, the model training unit may include:

第一训练子单元，用于将所述第一训练数据集输入初始域自适应神经网络模型，得到所述第一训练数据集的手势预测结果和域预测结果；a first training subunit, configured to input the first training data set into an initial domain adaptive neural network model to obtain the gesture prediction result and the domain prediction result of the first training data set;

第一计算子单元，用于根据所述第一训练数据集已分配的域标签和手势标签，以及所述第一训练数据集的手势预测结果和域预测结果，计算得到第一交叉熵E_label；The first calculation subunit is used to calculate the first cross entropy E _label according to the assigned domain label and gesture label of the first training data set, as well as the gesture prediction result and the domain prediction result of the first training data set ;

第二训练子单元，用于将所述第二训练数据集输入所述初始域自适应神经网络模型，得到所述第二训练数据集的手势预测结果和域预测结果；a second training subunit, configured to input the second training data set into the initial domain adaptive neural network model to obtain the gesture prediction result and the domain prediction result of the second training data set;

第二计算子单元，用于根据所述第二训练数据集已分配的域标签，以及所述第二训练数据集的手势预测结果和域预测结果，计算得到第二交叉熵E_unlab；The second calculation subunit is used to calculate the second cross entropy E _unlab according to the assigned domain label of the second training data set, and the gesture prediction result and the domain prediction result of the second training data set;

第三计算子单元，用于根据所述第一训练数据集已分配的域标签、所述第二训练数据集已分配的域标签、所述所述第一训练数据集的域预测结果以及所述第二训练数据集的域预测结果，计算得到域交叉熵E_s；The third calculation subunit is configured to be configured according to the assigned domain label of the first training data set, the assigned domain label of the second training data set, the domain prediction result of the first training data set, and the The domain prediction result of the second training data set is described, and the domain cross entropy E _s is obtained by calculation;

第四计算子单元，用于根据所述第一交叉熵、所述第二交叉熵和所述域交叉熵计算所述初始域自适应神经网络模型的交叉熵E＝E_label+αE_unlab-βE_s,其中α与β为域自适应神经网络模型的模型参数；The fourth calculation subunit is used to calculate the cross entropy of the initial domain adaptive neural network model according to the first cross entropy, the second cross entropy and the domain cross entropy E=E _label +αE _unlab -βE _s , where α and β are the model parameters of the domain adaptive neural network model;

模型优化子单元，用于以所述交叉熵E取最小值为目标，对所述初始域自适应神经网络模型的模型参数进行优化，得到所述域自适应神经网络模型。The model optimization subunit is configured to optimize the model parameters of the initial domain adaptive neural network model with the goal of taking the minimum value of the cross entropy E to obtain the domain adaptive neural network model.

在一个实施例中，信号获取模块301可以包括：In one embodiment, the signal acquisition module 301 may include:

多维度信号获取单元，用于获取由N个不同的麦克风设备分别采集到的N份所述目标音频信号，N为大于1的整数；A multi-dimensional signal acquisition unit, configured to acquire N parts of the target audio signal collected by N different microphone devices, where N is an integer greater than 1;

所述信道估计模块可以包括：The channel estimation module may include:

多维度信道估计单元，用于对于每份所述目标音频信号，均分别与所述原始音频信号进行信道估计，得到N份信道特征数据；A multi-dimensional channel estimation unit, configured to perform channel estimation with the original audio signal for each of the target audio signals to obtain N pieces of channel feature data;

多维度数据整合单元，用于对所述N份信道特征数据进行整合，得到所述信道估计的特征数据。A multi-dimensional data integration unit, configured to integrate the N pieces of channel characteristic data to obtain the characteristic data of the channel estimation.

本申请实施例还提供了一种终端设备，包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现如本申请提出的各个基于音频的手势识别方法的步骤。Embodiments of the present application further provide a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor, where the processor implements the computer program as described herein when executing the computer program. The steps of each audio-based gesture recognition method proposed by the application.

本申请实施例还提供了一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序被处理器执行时实现如本申请提出的各个基于音频的手势识别方法的步骤。Embodiments of the present application also provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, implements the audio-based gesture recognition methods proposed in the present application. step.

本申请实施例还提供了一种计算机程序产品，当计算机程序产品在终端设备上运行时，使得终端设备执行本申请提出的各个基于音频的手势识别方法的步骤。Embodiments of the present application also provide a computer program product, which, when the computer program product runs on a terminal device, enables the terminal device to execute the steps of each audio-based gesture recognition method proposed in the present application.

图9为本申请一实施例提供的终端设备的结构示意图。如图9所示，该实施例的终端设备4包括：至少一个处理器40(图中仅示出一个)处理器、存储器41以及存储在所述存储器41中并可在所述至少一个处理器40上运行的计算机程序42，所述处理器40执行所述计算机程序42时实现上述任意基于音频的手势识别方法实施例中的步骤。FIG. 9 is a schematic structural diagram of a terminal device provided by an embodiment of the present application. As shown in FIG. 9 , the terminal device 4 in this embodiment includes: at least one processor 40 (only one is shown in the figure), a processor, a memory 41, and a processor stored in the memory 41 and available in the at least one processor A computer program 42 running on the processor 40, when the processor 40 executes the computer program 42, implements the steps in any of the above-mentioned embodiments of the audio-based gesture recognition method.

所述终端设备4可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备，以及智能手表、智能手环等可穿戴设备。该终端设备可包括，但不仅限于，处理器40、存储器41。本领域技术人员可以理解，图9仅仅是终端设备4的举例，并不构成对终端设备4的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件，例如还可以包括输入输出设备、网络接入设备等。The terminal device 4 may be a computing device such as a desktop computer, a notebook, a palmtop computer and a cloud server, and a wearable device such as a smart watch and a smart bracelet. The terminal device may include, but is not limited to, the processor 40 and the memory 41 . Those skilled in the art can understand that FIG. 9 is only an example of the terminal device 4, and does not constitute a limitation on the terminal device 4. It may include more or less components than the one shown, or combine some components, or different components , for example, may also include input and output devices, network access devices, and the like.

所称处理器40可以是中央处理单元(Central Processing Unit，CPU)，该处理器40还可以是其他通用处理器、数字信号处理器(Digital Signal Processor，DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现成可编程门阵列(Field-Programmable Gate Array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The so-called processor 40 may be a central processing unit (Central Processing Unit, CPU), and the processor 40 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated Circuits) , ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

所述存储器41在一些实施例中可以是所述终端设备4的内部存储单元，例如终端设备4的硬盘或内存。所述存储器41在另一些实施例中也可以是所述终端设备4的外部存储设备，例如所述终端设备4上配备的插接式硬盘，智能存储卡(Smart Media Card，SMC)，安全数字(Secure Digital，SD)卡，闪存卡(Flash Card)等。进一步地，所述存储器41还可以既包括所述终端设备4的内部存储单元也包括外部存储设备。所述存储器41用于存储操作装置、应用程序、引导装载程序(BootLoader)、数据以及其他程序等，例如所述计算机程序的程序代码等。所述存储器41还可以用于暂时地存储已经输出或者将要输出的数据。The memory 41 may be an internal storage unit of the terminal device 4 in some embodiments, such as a hard disk or a memory of the terminal device 4 . In other embodiments, the memory 41 may also be an external storage device of the terminal device 4, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card) and so on. Further, the memory 41 may also include both an internal storage unit of the terminal device 4 and an external storage device. The memory 41 is used to store an operating device, an application program, a boot loader (Boot Loader), data, and other programs, such as program codes of the computer program, and the like. The memory 41 can also be used to temporarily store data that has been output or will be output.

所属领域的技术人员可以清楚地了解到，为了描述的方便和简洁，仅以上述各功能单元、模块的划分进行举例说明，实际应用中，可以根据需要而将上述功能分配由不同的功能单元、模块完成，即将所述装置的内部结构划分成不同的功能单元或模块，以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中，上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。另外，各功能单元、模块的具体名称也只是为了便于相互区分，并不用于限制本申请的保护范围。上述装置中单元、模块的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example. Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above. Each functional unit and module in the embodiment may be integrated in one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit, and the above-mentioned integrated units may adopt hardware. It can also be realized in the form of software functional units. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing from each other, and are not used to limit the protection scope of the present application. For the specific working process of the units and modules in the above apparatus, reference may be made to the corresponding process in the foregoing method embodiments, and details are not described herein again.

在上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述或记载的部分，可以参见其它实施例的相关描述。In the foregoing embodiments, the description of each embodiment has its own emphasis. For parts that are not described or described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

本领域普通技术人员可以意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

在本申请所提供的实施例中，应该理解到，所揭露的装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述模块或单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个装置，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口，装置或单元的间接耦合或通讯连接，可以是电性，机械或其它的形式。In the embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be Incorporation may either be integrated into another device, or some features may be omitted, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请实现上述实施例方法中的全部或部分流程，可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一计算机可读存储介质中，该计算机程序在被处理器执行时，可实现上述各个方法实施例的步骤。其中，所述计算机程序包括计算机程序代码，所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质至少可以包括：能够将计算机程序代码携带到终端设备的任何实体或装置、记录介质、计算机存储器、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、电载波信号、电信信号以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。在某些司法管辖区，根据立法和专利实践，计算机可读介质不可以是电载波信号和电信信号。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the present application realizes all or part of the processes in the methods of the above embodiments, which can be completed by instructing the relevant hardware through a computer program, and the computer program can be stored in a computer-readable storage medium. When executed by a processor, the steps of each of the above method embodiments can be implemented. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form, and the like. The computer-readable medium may include at least: any entity or device capable of carrying the computer program code to the terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electrical carrier signals, telecommunication signals, and software distribution media. For example, U disk, mobile hard disk, disk or CD, etc. In some jurisdictions, under legislation and patent practice, computer readable media may not be electrical carrier signals and telecommunications signals.

以上所述实施例仅用以说明本申请的技术方案，而非对其限制；尽管参照前述实施例对本申请进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围，均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the above-mentioned embodiments, those of ordinary skill in the art should understand that: it can still be used for the above-mentioned implementations. The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the application, and should be included in the within the scope of protection of this application.

Claims

1. An audio-based gesture recognition method, comprising:

acquiring a target audio signal, wherein the target audio signal is an audio signal received after a preset original audio signal is modulated and is transmitted through a target gesture made by a user;

performing channel estimation based on the original audio signal and the target audio signal to obtain characteristic data of channel estimation;

and identifying the characteristic data of the channel estimation to obtain an identification result of the target gesture.

2. The gesture recognition method according to claim 1, wherein the original audio signal is a periodic signal, and the performing channel estimation based on the original audio signal and the target audio signal to obtain the characteristic data of the channel estimation comprises:

Demodulating the target audio signal to obtain a target base frequency signal;

segmenting the target base frequency signal to obtain a plurality of target signal segments, wherein the length of each target signal segment is the same as the period of the original audio signal;

for each target signal segment, respectively performing channel estimation with a signal segment of one period in the original audio signal to obtain respective channel characteristic data;

and integrating the channel characteristic data of each target signal segment to obtain the characteristic data of the channel estimation.

3. The gesture recognition method of claim 2, wherein the demodulating the target audio signal to obtain a target baseband signal comprises:

carrying out carrier reduction and IQ decomposition on the target audio signal to obtain a real part signal and an imaginary part signal of the carrier reduction signal;

and denoising the real part signal and the imaginary part signal of the carrier-reduced signal by using a low-pass filter to obtain the target fundamental frequency signal.

4. The gesture recognition method according to claim 2, wherein each of the target signal segments has a time stamp, and the integrating the channel characteristic data of the respective target signal segments to obtain the characteristic data of the channel estimation comprises:

Respectively representing the channel characteristic data of each target signal segment into a vector form to obtain a characteristic vector of each target signal segment;

arranging the eigenvectors of the target signal segments according to the sizes of the corresponding timestamps to form an eigenvector matrix;

for each column in the feature matrix, subtracting each corresponding element value contained in the previous column from each element value contained in the feature matrix to obtain the feature matrix after static elimination;

and determining the feature matrix after static elimination as feature data of the channel estimation.

5. The gesture recognition method according to claim 1, wherein the recognizing the feature data of the channel estimation to obtain the recognition result of the target gesture comprises:

recognizing the characteristic data of the channel estimation by adopting a pre-constructed domain adaptive neural network model to obtain a recognition result of the target gesture;

wherein the domain adaptive neural network model is constructed by the following steps:

acquiring a first training data set and a second training data set, wherein the first training data set comprises a plurality of groups of channel estimation sample data of distributed domain labels and gesture labels, and the second training data set comprises a plurality of groups of channel estimation sample data of distributed domain labels but not distributed gesture labels;

And training to obtain the domain adaptive neural network model by taking the first training data set and the second training data set as training sets.

6. The gesture recognition method of claim 5, wherein training the domain-adaptive neural network model using the first training data set and the second training data set as training sets comprises:

inputting the first training data set into an initial domain adaptive neural network model to obtain a gesture prediction result and a domain prediction result of the first training data set;

calculating to obtain a first cross entropy E according to the assigned domain label and the assigned gesture label of the first training data set and the gesture prediction result and the domain prediction result of the first training data set_label；

Inputting the second training data set into the initial domain adaptive neural network model to obtain a gesture prediction result and a domain prediction result of the second training data set;

calculating to obtain a second cross entropy E according to the assigned domain label of the second training data set and the gesture prediction result and the domain prediction result of the second training data set_unlab；

Calculating to obtain a domain cross entropy E according to the assigned domain label of the first training data set, the assigned domain label of the second training data set, the domain prediction result of the first training data set and the domain prediction result of the second training data set _s；

Calculating the cross entropy E-E of the initial domain adaptive neural network model according to the first cross entropy, the second cross entropy and the domain cross entropy_label+αE_unlab-βE_sWherein alpha and beta are model parameters of the domain adaptive neural network model;

and optimizing the model parameters of the initial domain self-adaptive neural network model by taking the minimum value of the cross entropy E as a target to obtain the domain self-adaptive neural network model.

7. The gesture recognition method according to any one of claims 1 to 6, wherein the acquiring a target audio signal includes:

acquiring N parts of target audio signals respectively acquired by N different microphone devices, wherein N is an integer greater than 1;

the performing channel estimation based on the original audio signal and the target audio signal to obtain feature data of channel estimation includes:

for each target audio signal, respectively carrying out channel estimation on the target audio signal and the original audio signal to obtain N parts of channel characteristic data;

and integrating the N parts of channel characteristic data to obtain the characteristic data of the channel estimation.

8. An audio-based gesture recognition apparatus, comprising:

the signal acquisition module is used for acquiring a target audio signal, wherein the target audio signal is an audio signal which is received after a preset original audio signal is modulated and is transmitted through a target gesture made by a user;

The channel estimation module is used for carrying out channel estimation based on the original audio signal and the target audio signal to obtain characteristic data of channel estimation;

and the gesture recognition module is used for recognizing the characteristic data of the channel estimation to obtain a recognition result of the target gesture.

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the gesture recognition method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the gesture recognition method according to any one of claims 1 to 7.