CN111652165B

CN111652165B - Mouth evaluation method, equipment and computer storage medium

Info

Publication number: CN111652165B
Application number: CN202010514957.3A
Authority: CN
Inventors: 陈博昱; 冀志龙; 刘霄
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2020-06-08
Filing date: 2020-06-08
Publication date: 2022-05-17
Anticipated expiration: 2040-06-08
Also published as: CN111652165A

Abstract

Embodiments of the present application provide a mouth shape evaluation method, equipment, and computer storage medium. The mouth shape evaluation method includes: acquiring data to be evaluated of a target object, the data to be evaluated includes a sequence of image frames, and the sequence of image frames includes an image representing the mouth shape of the target pronunciation. at least one continuous image frame; determine the feature matrix of the image frame sequence according to the image frame sequence; obtain first evaluation data according to the feature matrix of the image frame sequence and the preset model, and the first evaluation data is used to indicate the feature matrix according to the preset model. The evaluation result; the lip shape evaluation information of the target object is generated according to the evaluation data, the evaluation data includes the first evaluation data, and the lip shape evaluation information is used to indicate the evaluation result of the target pronunciation lip shape of the target object. It can evaluate whether the pronunciation mouth shape is accurate, and the specific accuracy of the pronunciation mouth shape.

Description

Mouth evaluation method, equipment and computer storage medium

技术领域technical field

本申请实施例涉及人工智能技术领域，尤其涉及一种口型评测方法、设备及计算机存储介质。The embodiments of the present application relate to the technical field of artificial intelligence, and in particular, to a mouth shape evaluation method, device, and computer storage medium.

背景技术Background technique

人工智能技术广泛应用于生活中的各个领域，在一些应用场景中，例如，在语言教学领域，可以采用人工智能技术对学生发音进行评测；又如，在声乐教学楼领域，可以采用人工智能技术对学生唱歌发音进行评测。当然，此处只是举例说明。但是，则语音评测过程中，利用音频进行评测，只能确保连续的发音是大致相似的，无法精确评估发音的口型的准确性。Artificial intelligence technology is widely used in various fields of life. In some application scenarios, for example, in the field of language teaching, artificial intelligence technology can be used to evaluate students' pronunciation; another example, in the field of vocal music teaching buildings, artificial intelligence technology can be used. Evaluate students' singing pronunciation. Of course, this is just an example. However, in the process of speech evaluation, using audio for evaluation can only ensure that the continuous pronunciations are roughly similar, and cannot accurately evaluate the accuracy of the pronunciation of the mouth shape.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明实施例所解决的技术问题之一在于提供一种口型评测方法、设备及计算机存储介质，用以克服现有技术中无法精确评估发音口型的准确性。In view of this, one of the technical problems solved by the embodiments of the present invention is to provide a mouth shape evaluation method, device and computer storage medium to overcome the inability to accurately evaluate the accuracy of pronunciation mouth shapes in the prior art.

本申请实施例提供一种口型评测方法，其包括：The embodiment of the present application provides a mouth shape evaluation method, which includes:

获取目标对象的待评测数据，待评测数据包括图像帧序列，图像帧序列包括表示目标发音口型的连续的至少一个图像帧；Acquiring data to be evaluated of the target object, the data to be evaluated includes a sequence of image frames, and the sequence of image frames includes at least one continuous image frame representing the mouth shape of the target pronunciation;

根据图像帧序列确定图像帧序列的特征矩阵；Determine the feature matrix of the image frame sequence according to the image frame sequence;

根据图像帧序列的特征矩阵与预设模型得到第一评测数据，第一评测数据用于指示根据预设模型对特征矩阵评测的结果；Obtain first evaluation data according to the feature matrix of the image frame sequence and the preset model, where the first evaluation data is used to indicate the result of evaluating the feature matrix according to the preset model;

根据评测数据生成目标对象的口型评测信息，评测数据包括第一评测数据，口型评测信息用于表示对目标对象的目标发音口型的评测结果。The mouth shape evaluation information of the target object is generated according to the evaluation data, the evaluation data includes the first evaluation data, and the mouth shape evaluation information is used to indicate the evaluation result of the target pronunciation mouth shape of the target object.

可选地，在本申请的一种实施例中，根据图像帧序列确定图像帧序列的特征矩阵，包括：Optionally, in an embodiment of the present application, determining a feature matrix of the image frame sequence according to the image frame sequence, including:

在图像帧序列包含的图像帧中，确定至少一个嘴部关键点的坐标；In the image frames included in the image frame sequence, determine the coordinates of at least one key point of the mouth;

根据图像帧序列中图像帧的数量、各图像帧中嘴部关键点的数量以及坐标的维度，按照时间顺序生成图像帧序列的特征矩阵，图像帧序列的特征矩阵包括至少一个嘴部关键点的坐标。According to the number of image frames in the image frame sequence, the number of mouth key points in each image frame, and the dimension of the coordinates, a feature matrix of the image frame sequence is generated in time sequence, and the feature matrix of the image frame sequence includes at least one mouth key point. coordinate.

可选地，在本申请的一种实施例中，预设模型包括标准发音曲线模型；根据图像帧序列的特征矩阵与预设模型得到第一评测数据，包括：Optionally, in an embodiment of the present application, the preset model includes a standard pronunciation curve model; the first evaluation data is obtained according to the feature matrix of the image frame sequence and the preset model, including:

将至少一个嘴部关键点的坐标输入标准发音曲线模型，得到至少一个嘴部关键点的评分；根据至少一个嘴部关键点的评分得到图像帧的口型评分；对至少一个图像帧的口型评分进行归一化处理并生成第一评测数据。Input the coordinates of at least one key point of the mouth into the standard pronunciation curve model to obtain the score of at least one key point of the mouth; obtain the mouth shape score of the image frame according to the score of the at least one key point of the mouth; The scores are normalized and the first evaluation data is generated.

可选地，在本申请的一种实施例中，预设模型包括预设的转置矩阵，根据图像帧序列的特征矩阵与预设模型得到第一评测数据，包括：Optionally, in an embodiment of the present application, the preset model includes a preset transpose matrix, and the first evaluation data is obtained according to the feature matrix of the image frame sequence and the preset model, including:

将图像帧序列的特征矩阵与预设的转置矩阵进行点乘得到发音矩阵；The feature matrix of the image frame sequence is dot-multiplied with the preset transposed matrix to obtain the pronunciation matrix;

将发音矩阵与预设的标准发音矩阵进行对比得到第一评测数据。The first evaluation data is obtained by comparing the pronunciation matrix with the preset standard pronunciation matrix.

可选地，在本申请的一种实施例中，该方法还包括：Optionally, in an embodiment of the present application, the method further includes:

对图像帧序列进行口型评测，获取第二评测数据，第二评测数据用于表示图像帧中的目标发音口型与预设的标准发音口型的相似度；Perform mouth shape evaluation on the image frame sequence, and obtain second evaluation data, where the second evaluation data is used to represent the similarity between the target pronunciation mouth shape in the image frame and the preset standard pronunciation mouth shape;

根据评测数据生成目标对象的口型评测信息，包括：Generate lip-synch evaluation information of the target object according to the evaluation data, including:

对第一评测数据和第二评测数据进行加权运算并生成口型评测信息，评测数据包括第一评测数据和第二评测数据。A weighted operation is performed on the first evaluation data and the second evaluation data to generate mouth shape evaluation information, where the evaluation data includes the first evaluation data and the second evaluation data.

对图像帧序列中的至少一个图像帧进行口型识别；Perform mouth shape recognition on at least one image frame in the sequence of image frames;

根据识别结果确定目标发音口型与预设的标准发音口型的相似度，并生成第二评测数据。The similarity between the target pronunciation mouth shape and the preset standard pronunciation mouth shape is determined according to the recognition result, and the second evaluation data is generated.

可选地，在本申请的一种实施例中，根据识别结果确定目标发音口型与预设的标准发音口型的相似度之前，该方法还包括：Optionally, in an embodiment of the present application, before determining the similarity between the target pronunciation mouth shape and the preset standard pronunciation mouth shape according to the recognition result, the method further includes:

将图像帧序列与参考帧序列的帧数对齐，参考帧序列为预先设置的表示标准发音口型的帧序列。The image frame sequence is aligned with the frame number of the reference frame sequence, and the reference frame sequence is a preset frame sequence representing the standard pronunciation mouth shape.

可选地，在本申请的一种实施例中，将图像帧序列与参考帧序列的帧数对齐，包括：Optionally, in an embodiment of the present application, aligning the frame numbers of the image frame sequence with the reference frame sequence includes:

对图像帧序列进行降采样或者线性插值，以使得图像帧序列与参考帧序列的帧数对齐。The image frame sequence is down-sampled or linearly interpolated, so that the image frame sequence is aligned with the frame numbers of the reference frame sequence.

可选地，在本申请的一种实施例中，待评测数据还包括与图像帧序列对应的音频数据，该方法还包括：Optionally, in an embodiment of the present application, the data to be evaluated further includes audio data corresponding to the sequence of image frames, and the method further includes:

根据预设的参考音频对音频数据进行评测生成第三评测数据，第三评测数据用于表示音频数据与参考音频的发音相似度；The audio data is evaluated according to the preset reference audio to generate third evaluation data, where the third evaluation data is used to represent the pronunciation similarity between the audio data and the reference audio;

对第一评测数据和第三评测数据进行加权运算并生成口型评测信息，评测数据包括第一评测数据和第三评测数据。A weighted operation is performed on the first evaluation data and the third evaluation data to generate mouth shape evaluation information, where the evaluation data includes the first evaluation data and the third evaluation data.

可选地，在本申请的一种实施例中，根据预设的参考音频对音频数据进行评测生成第三评测数据，包括：Optionally, in an embodiment of the present application, the audio data is evaluated according to a preset reference audio to generate third evaluation data, including:

将音频数据的起始时刻向前延展第一预设时长，并将音频数据的起始时刻向后延展第二预设时长，得到延展后的音频数据；Extending the start time of the audio data forward by a first preset time length, and extending the start time of the audio data backward by a second preset time length, to obtain the extended audio data;

根据预设的参考音频对延展后的音频数据进行评测生成第三评测数据。The extended audio data is evaluated according to the preset reference audio to generate third evaluation data.

本申请实施例提供一种口型评测设备，包括：获取模块、矩阵模块、评测模块，综合模块；An embodiment of the present application provides a mouth shape evaluation device, including: an acquisition module, a matrix module, an evaluation module, and a synthesis module;

其中，所述取模块，用于获取目标对象的待评测数据，待评测数据包括图像帧序列，图像帧序列包括表示目标发音口型的连续的至少一个图像帧；Wherein, the obtaining module is used to obtain the data to be evaluated of the target object, the data to be evaluated includes a sequence of image frames, and the sequence of image frames includes at least one continuous image frame representing the mouth shape of the target pronunciation;

所述矩阵模块，用于根据图像帧序列确定图像帧序列的特征矩阵；The matrix module is used to determine the feature matrix of the image frame sequence according to the image frame sequence;

所述评测模块，用于根据图像帧序列的特征矩阵与预设模型得到第一评测数据，第一评测数据用于指示根据预设模型对特征矩阵评测的结果；The evaluation module is configured to obtain first evaluation data according to the feature matrix of the image frame sequence and the preset model, and the first evaluation data is used to indicate the evaluation result of the feature matrix according to the preset model;

所述综合模块，用于根据评测数据生成目标对象的口型评测信息，评测数据包括第一评测数据，口型评测信息用于表示对目标对象的目标发音口型的评测结果。The comprehensive module is used for generating mouth shape evaluation information of the target object according to the evaluation data, the evaluation data includes the first evaluation data, and the mouth shape evaluation information is used to indicate the evaluation result of the target pronunciation mouth shape of the target object.

本申请实施例提供一种电子设备，包括：处理器和存储器，处理器和存储器连接，存储器存储有计算机程序，处理器用于执行计算机程序实现如本申请任一实施例所描述的口型评测方法。An embodiment of the present application provides an electronic device, including: a processor and a memory, the processor is connected to the memory, the memory stores a computer program, and the processor is configured to execute the computer program to implement the mouth shape evaluation method described in any embodiment of the present application .

本申请实施例提供一种计算机存储介质，包括：计算机存储介质存储有计算机程序，在处理器执行计算机程序时，实现如本申请任一实施例所描述的口型评测方法。An embodiment of the present application provides a computer storage medium, including: the computer storage medium stores a computer program, and when the processor executes the computer program, the mouth shape evaluation method described in any embodiment of the present application is implemented.

本申请实施例中，根据图像帧序列中连续的至少一个图像帧确定图像帧序列的特征矩阵，根据特征矩阵与预设模型得到第一评测数据，以此评测发音口型的准确性，因为预设模型表示了标准的发音口型，而预设模型上的点表示了标准的发音口型的变化过程，利用预设模型对图像真序列的特征矩阵进行评测，可以确定在发音口型的变化中，图像帧序列所表示的发音口型与标准的发音口型之间的差距，能够评测发音口型是否准确，以及发音口型具体的准确性。In the embodiment of the present application, the feature matrix of the image frame sequence is determined according to at least one continuous image frame in the image frame sequence, and the first evaluation data is obtained according to the feature matrix and the preset model, so as to evaluate the accuracy of the pronunciation mouth shape. Assume that the model represents the standard pronunciation mouth shape, and the points on the preset model represent the change process of the standard pronunciation mouth shape. Using the preset model to evaluate the feature matrix of the true image sequence, the changes in the pronunciation mouth shape can be determined. In , the gap between the pronunciation mouth shape represented by the image frame sequence and the standard pronunciation mouth shape can be used to evaluate whether the pronunciation mouth shape is accurate and the specific accuracy of the pronunciation mouth shape.

附图说明Description of drawings

后文将参照附图以示例性而非限制性的方式详细描述本申请实施例的一些具体实施例。附图中相同的附图标记标示了相同或类似的部件或部分。本领域技术人员应该理解，这些附图未必是按比值绘制的。附图中：Hereinafter, some specific embodiments of the embodiments of the present application will be described in detail by way of example and not limitation with reference to the accompanying drawings. The same reference numbers in the figures designate the same or similar parts or parts. It will be understood by those skilled in the art that the drawings are not necessarily to scale. In the attached picture:

图1为本申请实施例一提供的一种口型评测方法的流程图；1 is a flowchart of a mouth shape evaluation method provided in Embodiment 1 of the present application;

图2为本申请实施例二提供的一种口型评测方法的流程图；2 is a flowchart of a mouth shape evaluation method provided in Embodiment 2 of the present application;

图2a为本申请实施例二提供的一种音频数据的波形示意图；2a is a schematic diagram of a waveform of audio data according to Embodiment 2 of the present application;

图2b为本申请实施例二提供的一种人脸关键点示意图；2b is a schematic diagram of a key point of a human face provided in Embodiment 2 of the present application;

图2c为本申请实施例二提供的一种音素分解效果示意图；FIG. 2c is a schematic diagram of a phoneme decomposition effect provided in Embodiment 2 of the present application;

图3为本申请实施例三提供的一种口型评测设备的结构图；FIG. 3 is a structural diagram of a mouth shape evaluation device provided in Embodiment 3 of the present application;

图4为本申请实施例四提供的一种电子设备的结构图。FIG. 4 is a structural diagram of an electronic device according to Embodiment 4 of the present application.

具体实施方式Detailed ways

下面结合本发明实施例附图进一步说明本发明实施例具体实现。The specific implementation of the embodiments of the present invention is further described below with reference to the accompanying drawings of the embodiments of the present invention.

实施例一、Embodiment 1.

本申请实施例一提供一种口型评测方法，如图1所示，图1为本申请实施例提供的一种口型评测方法的流程图。该口型评测方法包括以下步骤：Embodiment 1 of the present application provides a mouth shape evaluation method, as shown in FIG. 1 , which is a flowchart of a mouth shape evaluation method provided by an embodiment of the present application. The mouth shape evaluation method includes the following steps:

步骤101、获取目标对象的待评测数据。Step 101: Acquire the data to be evaluated of the target object.

需要说明的是，本申请所提供的口型评测方法用于评测目标对象在发出目标声音时口型的准确性，目标对象可以是任意一个人或者任意一个一个动物等，本申请对此不作限制，目标声音可以是某种语言的一个词语，或者一句话，或者一段话，也可以是某个乐曲的一节音乐，或者说一首歌的一句或一段，当然，此处只是示例性说明，并不代表本申请局限于此。It should be noted that the mouth shape evaluation method provided in this application is used to evaluate the accuracy of the mouth shape of the target object when making the target sound. The target object can be any person or any animal, etc., which is not limited in this application. , the target sound can be a word, or a sentence, or a paragraph of a certain language, or a piece of music from a certain piece of music, or a sentence or a paragraph of a song. Of course, this is just an exemplary illustration. It does not mean that the present application is limited to this.

待评测数据包括图像帧序列，图像帧序列包括表示目标发音口型的连续的至少一个图像帧，待评测数据还可以包括与图像帧序列对应的音频数据。需要说明的是，目标发音口型为目标对象发出目标声音时的口型，本申请中，目标只是用于表示单数，而不用做任何限定，可以是任意一个人或者任意一个动物发出任意一种声音时的口型，当然，此处只是示例性说明。图像帧序列包括目标对象发出目标声音时的连续的至少一个图像帧，需要说明的是，此处连续的至少一个图像帧只是表明图像帧的排列顺序是按照时间顺序排列的，并不代表两个图像帧之间的时间差很小，例如，图像帧序列包括10个图像帧，按照时间顺序，第(n+1)个图像帧排序在第n个图像帧之后，n为大于0小于10的整数。至于两个图像帧时间的时间间隔，每两个相邻图像帧之间的时间间隔可以相同，也可以不同，可以是3ms，或者5ms或者1s，本申请对此不做限制。The data to be evaluated includes a sequence of image frames, the sequence of image frames includes at least one continuous image frame representing the target pronunciation mouth shape, and the data to be evaluated may also include audio data corresponding to the sequence of image frames. It should be noted that the target pronunciation mouth shape is the mouth shape when the target object emits the target sound. In this application, the target is only used to represent the singular without any limitation, and it can be any person or any animal. The mouth shape at the time of the sound, of course, is only an example here. The sequence of image frames includes at least one continuous image frame when the target object emits the target sound. It should be noted that the continuous at least one image frame here only indicates that the image frames are arranged in chronological order, and does not represent two consecutive image frames. The time difference between image frames is very small. For example, the image frame sequence includes 10 image frames. According to the time sequence, the (n+1)th image frame is sorted after the nth image frame, and n is an integer greater than 0 and less than 10. . As for the time interval between two image frames, the time interval between every two adjacent image frames may be the same or different, and may be 3 ms, 5 ms, or 1 s, which is not limited in this application.

步骤102、根据图像帧序列确定图像帧序列的特征矩阵。Step 102: Determine a feature matrix of the image frame sequence according to the image frame sequence.

需要说明的是，图像帧序列的特征矩阵可以表示图像帧序列的各图像帧中嘴部的形状特征。图像帧序列的特征矩阵可以由图像帧的特征向量组成的，例如，一个图像帧对应一个特征向量，至少一个图像帧的特征向量，即至少一个特征向量构成了图像帧序列的特征矩阵。当然，此处只是示例性说明，并不代表本申请局限于此。It should be noted that the feature matrix of the image frame sequence may represent the shape feature of the mouth in each image frame of the image frame sequence. The feature matrix of the image frame sequence may be composed of feature vectors of the image frame. For example, one image frame corresponds to one feature vector, and the feature vector of at least one image frame, that is, at least one feature vector constitutes the feature matrix of the image frame sequence. Of course, this is only an exemplary description, which does not mean that the present application is limited thereto.

在图像帧序列包含的图像帧中，确定至少一个嘴部关键点的坐标；根据图像帧序列中图像帧的数量、各图像帧中嘴部关键点的数量以及坐标的维度，按照时间顺序生成图像帧序列的特征矩阵，图像帧序列的特征矩阵包括至少一个嘴部关键点的坐标。具体的，图像帧序列的特征矩阵包括图像帧序列中各图像帧的特征向量，一个图像帧的特征向量可以包括一组嘴部关键点的坐标，例如，嘴部关键点有n个，则一个图像帧的特征向量可以包括n个嘴部关键点的坐标。In the image frames included in the image frame sequence, the coordinates of at least one mouth key point are determined; according to the number of image frames in the image frame sequence, the number of mouth key points in each image frame, and the dimension of the coordinates, images are generated in time sequence The feature matrix of the frame sequence, the feature matrix of the image frame sequence includes the coordinates of at least one mouth key point. Specifically, the feature matrix of the image frame sequence includes the feature vector of each image frame in the image frame sequence, and the feature vector of one image frame may include a set of coordinates of key points of the mouth. For example, if there are n key points of the mouth, then one The feature vector of the image frame may include the coordinates of n mouth keypoints.

在本申请中，因为是口型评测，所以可以选择嘴部关键点，即可以在图像帧的唇部区域选择。因为嘴部关键点的坐标包含横纵两个坐标，因此，坐标的维度可以是2，特征矩阵可以表示为S(N,k,2)，其中，S表示特征矩阵，N表示图像帧的数量，k表示关键点的数量。当然，此处只是示例性说明，并不代表本申请局限于此。In this application, because it is a mouth shape evaluation, the key points of the mouth can be selected, that is, the lip region of the image frame can be selected. Because the coordinates of the key points of the mouth include horizontal and vertical coordinates, the dimension of the coordinates can be 2, and the feature matrix can be expressed as S(N,k,2), where S represents the feature matrix and N represents the number of image frames , k represents the number of keypoints. Of course, this is only an exemplary description, which does not mean that the present application is limited thereto.

可选地，在本申请的一种实施例中，该方法还包括：将图像帧序列与参考帧序列的帧数对齐，参考帧序列为预先设置的表示标准发音口型的帧序列。进一步可选地，可以对图像帧序列进行降采样或者线性插值，以使得图像帧序列与参考帧序列的帧数对齐。Optionally, in an embodiment of the present application, the method further includes: aligning the frame numbers of the image frame sequence with the reference frame sequence, where the reference frame sequence is a preset frame sequence representing standard pronunciation mouth shapes. Further optionally, the image frame sequence may be down-sampled or linearly interpolated, so that the image frame sequence is aligned with the frame numbers of the reference frame sequence.

因为参考帧序列表示了标准的发音口型，当然，此处标准的发音口型与图像帧序列所表示的目标发音口型都是在发出同一种声音，例如，都是在发出“congratulation”这个单词，目标发音口型是目标对象在发出“congratulation”这个单词时的口型，而标准的发音口型则是老师或者较为专业的语言专家发出“congratulation”这个单词时的口型，通过对图像帧序列进行评测从而达到对口型评测的目的。在进行评测之前，将图像帧序列与参考帧序列对齐，使得帧数相同，这样评测准确性更高，具体的，在图像帧序列的帧数大于参考帧序列的帧数时，对图像帧序列进行降采样，减少图像帧序列的帧数；在图像帧序列的帧数小于参考帧序列的帧数时，对图像帧序列进行线性插值，增加图像帧序列的帧数；如果图像帧序列的帧数等于参考帧序列的帧数，则表明帧数已经对齐，可以开始评测。Because the reference frame sequence represents the standard pronunciation mouth shape, of course, the standard pronunciation mouth shape and the target pronunciation mouth shape represented by the image frame sequence are both uttering the same sound, for example, they are both uttering the word "congratulation" Word, the target pronunciation mouth shape is the mouth shape of the target object when the word "congratulation" is pronounced, and the standard pronunciation mouth shape is the mouth shape of the teacher or a more professional language expert when the word "congratulation" is issued. The frame sequence is evaluated to achieve the purpose of lip-synch evaluation. Before the evaluation, the image frame sequence is aligned with the reference frame sequence, so that the number of frames is the same, so that the evaluation accuracy is higher. Perform down-sampling to reduce the number of frames of the image frame sequence; when the number of frames of the image frame sequence is less than that of the reference frame sequence, perform linear interpolation on the image frame sequence to increase the number of frames of the image frame sequence; If the number is equal to the number of frames in the reference frame sequence, it indicates that the number of frames has been aligned and the evaluation can begin.

步骤103、根据图像帧序列的特征矩阵与预设模型得到第一评测数据。Step 103: Obtain first evaluation data according to the feature matrix of the image frame sequence and the preset model.

第一评测数据用于指示根据预设模型对特征矩阵评测的结果。结合步骤102中的描述，预设模型可以是根据参考帧序列的特征矩阵训练而成的。例如，数据库中有J个种类的标准发音口型，每个标准发音口型都是用于发出一种声音，可以是句子、单词等的标准发音。为每一个标准发音口型的参考帧序列构建特征矩阵，并构建一维模型。具体的，可以将一个参考帧序列中每个参考帧的k个唇部关键点坐标作为输入，人工标注的得分作为输出，用至少一个神经元，拟合出一条该标准发音口型的曲线模型，对每个标准发音口型的参考帧序列都可以拟合出一个曲线模型，数据库中就可以有J个曲线模型。The first evaluation data is used to indicate the result of evaluating the feature matrix according to the preset model. With reference to the description in step 102, the preset model may be trained according to the feature matrix of the reference frame sequence. For example, there are J types of standard pronunciation mouth shapes in the database, and each standard pronunciation mouth shape is used to make a sound, which can be the standard pronunciation of sentences, words, and the like. A feature matrix is constructed for the reference frame sequence of each standard pronunciation mouth shape, and a one-dimensional model is constructed. Specifically, the coordinates of the k lip key points of each reference frame in a reference frame sequence can be used as input, the manually marked score can be used as output, and at least one neuron can be used to fit a curve model of the standard pronunciation mouth shape , a curve model can be fitted to the reference frame sequence of each standard pronunciation mouth shape, and there can be J curve models in the database.

此处，列举两种具体示例分别说明如何得到第一评测数据。Here, two specific examples are given to illustrate how to obtain the first evaluation data.

可选地，在本申请的第一种示例中，预设模型包括标准发音曲线模型；根据图像帧序列的特征矩阵与预设模型得到第一评测数据，包括：Optionally, in the first example of the present application, the preset model includes a standard pronunciation curve model; the first evaluation data is obtained according to the feature matrix of the image frame sequence and the preset model, including:

特征向量表示了图像帧的嘴部关键点的坐标，标准发音曲线模型用于对特征向量打分，即对每个图像帧的嘴部关键点进行评分，进而得到每个图像帧的口型进行评分，该评分用于表示发音口型的准确性。该标准发音曲线模型可以利用多个样本的发音口型的特征向量与人工标注的评分进行训练。The feature vector represents the coordinates of the key points of the mouth of the image frame, and the standard pronunciation curve model is used to score the feature vector, that is, the key points of the mouth of each image frame are scored, and then the mouth shape of each image frame is scored. , the score is used to express the accuracy of the vocal lip. The standard pronunciation curve model can be trained by using the feature vectors of the pronunciation mouth shapes of multiple samples and the manually annotated scores.

可选地，在本申请的第二种实施例中，预设模型包括预设的转置矩阵，根据图像帧序列的特征矩阵与预设模型得到第一评测数据，包括：Optionally, in the second embodiment of the present application, the preset model includes a preset transpose matrix, and the first evaluation data is obtained according to the feature matrix of the image frame sequence and the preset model, including:

将图像帧序列的特征矩阵与预设的转置矩阵进行点乘得到发音矩阵；将发音矩阵与预设的标准发音矩阵进行对比得到第一评测数据。Dot-multiply the feature matrix of the image frame sequence and the preset transposition matrix to obtain the pronunciation matrix; compare the pronunciation matrix with the preset standard pronunciation matrix to obtain the first evaluation data.

步骤104、根据评测数据生成目标对象的口型评测信息。Step 104 , generating lip shape evaluation information of the target object according to the evaluation data.

评测数据包括第一评测数据，口型评测信息用于表示对目标对象的目标发音口型的评测结果。The evaluation data includes the first evaluation data, and the mouth shape evaluation information is used to represent the evaluation result of the target pronunciation mouth shape of the target object.

口型评测信息还可以结合更多的数据进行评测，此处，列举三个具体示例进行说明如下：The mouth shape evaluation information can also be evaluated in combination with more data. Here, three specific examples are listed as follows:

可选地，在第一个示例中，该方法还包括：对图像帧序列进行口型评测，获取第二评测数据，第二评测数据用于表示图像帧中的目标发音口型与预设的标准发音口型的相似度；Optionally, in the first example, the method further includes: performing mouth shape evaluation on the image frame sequence, and obtaining second evaluation data, where the second evaluation data is used to represent the target pronunciation mouth shape in the image frame and the preset mouth shape. The similarity of standard pronunciation mouth shapes;

根据评测数据生成目标对象的口型评测信息，包括：对第一评测数据和第二评测数据进行加权运算并生成口型评测信息，评测数据包括第一评测数据和第二评测数据。第一评测数据可以表示目标对象在连续发音的过程中的评测结果，第二评测数据可以表示每一个图像帧中目标对象发音口型的评测结果，将第一评测数据和第二评测数据结合起来对目标对象进行口型评测更加准确和全面。Generating the mouth shape evaluation information of the target object according to the evaluation data includes: performing a weighted operation on the first evaluation data and the second evaluation data to generate the mouth shape evaluation information, and the evaluation data includes the first evaluation data and the second evaluation data. The first evaluation data can represent the evaluation results of the target object in the process of continuous pronunciation, and the second evaluation data can represent the evaluation results of the target object's pronunciation and mouth shape in each image frame, combining the first evaluation data and the second evaluation data. The lip evaluation of the target object is more accurate and comprehensive.

进一步可选地，说明具体如何获取第二评测数据，在本申请的一种实施例中，该方法还包括：对图像帧序列中的至少一个图像帧进行口型识别；根据识别结果确定目标发音口型与预设的标准发音口型的相似度，并生成第二评测数据。Further optionally, it is described how to obtain the second evaluation data. In an embodiment of the present application, the method further includes: performing mouth shape recognition on at least one image frame in the image frame sequence; determining the target pronunciation according to the recognition result The similarity between the mouth shape and the preset standard pronunciation mouth shape, and generate the second evaluation data.

可选地，结合步骤102的描述，根据识别结果确定目标发音口型与预设的标准发音口型的相似度之前，该方法还可以包括：将图像帧序列与参考帧序列的帧数对齐，参考帧序列为预先设置的表示标准发音口型的帧序列。进一步可选地，将图像帧序列与参考帧序列的帧数对齐，包括：对图像帧序列进行降采样或者线性插值，以使得图像帧序列与参考帧序列的帧数对齐。将图像帧序列与参考帧序列对齐，使得帧数相同，这样评测准确性更高。Optionally, in conjunction with the description of step 102, before determining the similarity between the target pronunciation mouth shape and the preset standard pronunciation mouth shape according to the recognition result, the method may also include: aligning the frame number of the image frame sequence with the reference frame sequence, The reference frame sequence is a preset frame sequence representing the standard pronunciation mouth shape. Further optionally, aligning the frame numbers of the image frame sequence with the reference frame sequence includes: performing down-sampling or linear interpolation on the image frame sequence, so that the image frame sequence is aligned with the frame numbers of the reference frame sequence. Align the image frame sequence with the reference frame sequence so that the number of frames is the same, so that the evaluation accuracy is higher.

可选地，在第二个示例中，待评测数据还包括与图像帧序列对应的音频数据，该方法还包括：根据预设的参考音频对音频数据进行评测生成第三评测数据，第三评测数据用于表示音频数据与参考音频的发音相似度；Optionally, in the second example, the data to be evaluated further includes audio data corresponding to the image frame sequence, and the method further includes: evaluating the audio data according to a preset reference audio to generate third evaluation data, and the third evaluation The data is used to represent the pronunciation similarity between the audio data and the reference audio;

根据评测数据生成目标对象的口型评测信息，包括：对第一评测数据和第三评测数据进行加权运算并生成口型评测信息，评测数据包括第一评测数据和第三评测数据。第三评测数据用于表示对目标对象的音频数据进行评测的结果，结合第一评测数据和第三评测数据对目标对象进行口型评测，更加准确和全面。Generating the mouth shape evaluation information of the target object according to the evaluation data includes: performing a weighted operation on the first evaluation data and the third evaluation data to generate the mouth shape evaluation information, and the evaluation data includes the first evaluation data and the third evaluation data. The third evaluation data is used to represent the result of evaluating the audio data of the target object, and the mouth shape evaluation of the target object is more accurate and comprehensive by combining the first evaluation data and the third evaluation data.

可选地，根据预设的参考音频对音频数据进行评测生成第三评测数据，包括：Optionally, the audio data is evaluated according to the preset reference audio to generate third evaluation data, including:

将音频数据的起始时刻向前延展第一预设时长，并将音频数据的起始时刻向后延展第二预设时长，得到延展后的音频数据；根据预设的参考音频对延展后的音频数据进行评测生成第三评测数据。将音频数据前后时间进行延展，起到一定的降噪和正则化作用，可以提高评测精度。Extend the start time of the audio data forward by a first preset duration, and extend the start time of the audio data backward by a second preset duration to obtain the extended audio data; The audio data is evaluated to generate third evaluation data. Extending the time before and after the audio data plays a certain role in noise reduction and regularization, which can improve the evaluation accuracy.

可选地，在第三个示例中，根据评测数据生成目标对象的口型评测信息，包括：Optionally, in the third example, the lip shape evaluation information of the target object is generated according to the evaluation data, including:

对第一评测数据、第二评测数据、第三评测数据进行加权运算并生成口型评测信息，评测数据包括第一评测数据、第二评测数据和第三评测数据，第二评测数据用于表示图像帧中的目标发音口型与预设的标准发音口型的相似度，第三评测数据用于表示音频数据与参考音频的发音相似度。Perform a weighted operation on the first evaluation data, the second evaluation data, and the third evaluation data to generate mouth shape evaluation information. The evaluation data includes the first evaluation data, the second evaluation data and the third evaluation data, and the second evaluation data is used to represent The similarity between the target pronunciation mouth shape in the image frame and the preset standard pronunciation mouth shape, and the third evaluation data is used to indicate the pronunciation similarity between the audio data and the reference audio.

实施例二、Embodiment two,

基于上述实施例一中所描述的口型评测方法，本申请实施例二进一步提供一种口型评测方法，是实施例一中方法的具体实现方式，参照图2所示，图2为本申请实施例二提供的一种口型评测方法的流程图，该方法包括以下步骤：Based on the mouth shape evaluation method described in the first embodiment, the second embodiment of the present application further provides a mouth shape evaluation method, which is a specific implementation of the method in the first embodiment. Referring to FIG. 2 , FIG. 2 is the present application. A flow chart of a mouth shape evaluation method provided by the second embodiment, the method comprises the following steps:

步骤201、获取待测评数据。Step 201: Acquire data to be evaluated.

在本实施例中，待测评数据包括发音视频，可以在该发音视频中截取出图像帧序列和音频数据。在一种应用场景中，例如，在课堂中，学生可以全程开启摄像头和麦克风，在进入口型评测环节时，开始截取学生在读字母、单词或句子时候的发音视频，口型评测环节结束后，停止获取，此时获取的视频包含图像帧序列和音频数据。In this embodiment, the data to be evaluated includes a pronunciation video, and an image frame sequence and audio data can be cut out from the pronunciation video. In one application scenario, for example, in the classroom, students can turn on the camera and microphone throughout the whole process, and start to intercept the pronunciation video of students reading letters, words or sentences when entering the lip-synch evaluation. After the lip-evaluation process, Stop the acquisition, and the acquired video contains image frame sequence and audio data.

步骤202、在待测评数据中分离出音频数据。Step 202: Separate audio data from the data to be evaluated.

步骤203、对音频数据进行降噪处理。Step 203 , perform noise reduction processing on the audio data.

例如，可以使用维纳滤波对音频数据进行降噪处理，以降低信噪比，从而提升对音频数据识别的准确率，以及对音频数据评分的准确率。For example, Wiener filtering can be used to perform noise reduction processing on audio data to reduce the signal-to-noise ratio, thereby improving the accuracy of audio data recognition and the accuracy of audio data scoring.

步骤204、在音频数据中截取出目标声音的音频数据段。Step 204: Cut out the audio data segment of the target sound from the audio data.

例如，可以在音频数据的波形中找到第一个明显突出的波峰，第一个明显突出的波峰可以是学生开始开口发音的起始时刻；在连续的波峰后如果之后再没有波峰，则认定已经发音结束，并确定最后一个波峰为发音的结束时刻。如图2a所示，图2a为本申请实施例二提供的一种音频数据的波形示意图。因为参考音频(即标准发音的音频数据)是事先录制好的，为了提高评测精度，需要将音频数据段与参考音频对齐。例如，参考音频将开始发音的起始时刻向前延展100毫秒，即并将结束时刻向后延展100毫秒，使得参考音频更容易辨别，因此音频数据段可以做同样的处理，以便与标准发音对齐，提高评测精度。For example, the first prominent peak can be found in the waveform of the audio data, and the first prominent peak can be the starting moment when the student begins to speak; The pronunciation ends, and the last wave peak is determined as the end time of the pronunciation. As shown in FIG. 2a, FIG. 2a is a schematic diagram of a waveform of audio data according to Embodiment 2 of the present application. Because the reference audio (that is, the audio data of the standard pronunciation) is recorded in advance, in order to improve the evaluation accuracy, it is necessary to align the audio data segment with the reference audio. For example, the reference audio extends the start of the utterance forward by 100 milliseconds, i.e., extends the end moment back by 100 milliseconds, to make the reference audio easier to discern, so the audio data segment can do the same to align with the standard pronunciation , to improve the evaluation accuracy.

需要说明的是，发音的起始时刻以及结束时刻可以通过语音活动检测(英文：Voice Activity Detection，VAD)技术来检测，通过语音活动检测技术可以检测到所有发音的时间，例如，通过语音活动检测技术获取多个时刻的发音判断结果，如果发音标记为1，未发音标记为0，分别取第一个1和最后一个1对应的时刻作为发音的起始时刻以及结束时刻。当然，此处只是示例性说明，并不代表本申请局限于此。It should be noted that the start time and end time of pronunciation can be detected by voice activity detection (English: Voice Activity Detection, VAD) technology, and the time of all pronunciation can be detected by voice activity detection technology, for example, through voice activity detection The technology obtains the pronunciation judgment results of multiple times. If the pronunciation is marked as 1 and the unpronounced is marked as 0, the times corresponding to the first 1 and the last 1 are taken as the start and end times of the pronunciation. Of course, this is only an exemplary description, which does not mean that the present application is limited thereto.

步骤205、音频数据段与参考音频进行对比得到第三评测数据。Step 205 , comparing the audio data segment with the reference audio to obtain third evaluation data.

第三评测数据表示音频数据与参考音频的发音相似度，当然，第三评测数据也可以表示音频数据与参考音频的差距，总之能够用于评价音频数据的发音准确性即可。The third evaluation data represents the pronunciation similarity between the audio data and the reference audio. Of course, the third evaluation data may also represent the difference between the audio data and the reference audio, and in short, it can be used to evaluate the pronunciation accuracy of the audio data.

步骤206、根据音频数据段的起始时刻和结束时刻在发音视频中截取图像帧序列。Step 206: Capture a sequence of image frames in the pronunciation video according to the start time and end time of the audio data segment.

根据音频数据段的起始时刻和结束时可，对应的截取出图像帧序列，同样，也可以对图像帧序列的起始时刻向前延展100ms，并将图像帧序列的结束时刻向后延展100ms，不仅可以与参考音频对齐，还可以起到一定的降噪和正则化作用。According to the start time and end time of the audio data segment, the corresponding image frame sequence can be cut out. Similarly, the start time of the image frame sequence can be extended forward by 100ms, and the end time of the image frame sequence can be extended backward by 100ms , which can not only align with the reference audio, but also play a certain role in noise reduction and regularization.

步骤207、对图像帧序列中的图像帧进行人脸检测。Step 207: Perform face detection on the image frames in the image frame sequence.

用f表示图像帧，图像帧序列可以表示为f₀,f₁,f₂……f_n，逐帧进行人脸检测，确定出人脸区域。在进行人脸检测之前，需要对图像帧进行预处理，包括将原通道顺RGB的图像帧，转换为BGR，将图像尺寸缩放成512×512像素大小等。将预处理后的图像帧输入人脸检测模型中进行检测，确定人脸区域，例如，可以使用P(x₁，y₁，x₂，y₂)来表示人脸区域，(x₁，y₁)和(x₂，y₂)可以分别表示人脸区域两个对角的顶点坐标。The image frame is represented by f, and the image frame sequence can be represented as f ₀ , f ₁ , f ₂ ...... f _n , and face detection is performed frame by frame to determine the face area. Before face detection, the image frame needs to be preprocessed, including converting the original channel RGB image frame to BGR, scaling the image size to 512×512 pixels, etc. Input the preprocessed image frame into the face detection model for detection and determine the face area. For example, P(x ₁ , y ₁ , x ₂ , y ₂ ) can be used to represent the face area, (x ₁ , y ₁ ) and (x ₂ , y ₂ ) can respectively represent the vertex coordinates of the two opposite corners of the face region.

步骤208、对图像帧中的人脸区域进行关键点检测得到关键点信息。Step 208: Perform key point detection on the face region in the image frame to obtain key point information.

在本申请中，关键点可以包括嘴部关键点，当然，关键点也可以包括其他区域的关键点，只要能够结合起来对发音口型进行评测即可。关键点信息可以包括至少一个嘴部关键点的坐标。在进行关键点检测之前，可以将人脸区域进行了向外部扩展操作，扩展可以增加鲁棒性，如果不扩展，一些点会预测到图片的边缘处，引起较大的误差，扩展之后就可以较好的回归到正确的脸部。具体的，x₁＝x₁–(x₂-x₁)×0.2，x₂＝x₂+(x₂-x₁)×0.2，y₁＝y₁-(y₂-y₁)×0.2，y₂＝y₂+(y₂-y₁)×0.2，通过如上的扩展操作后，再将人脸剪裁出来，将剪裁后的人脸缩放到模型中预设的尺寸，即适应模型的尺寸，例如，可以是将人脸缩放到112×112大小的尺寸，再将缩放后的图片除以256。经过上述处理后，再将处理后的图片输入到人脸关键点检测模型中，最终得到人脸器官轮廓的106个关键点，当然，此处只是示例性说明，关键点的数量也可以是100个或者50个，本申请对此不作限制。如图2b所示，图2b为本申请实施例二提供的一种人脸关键点示意图。In this application, the key points may include the key points of the mouth. Of course, the key points may also include the key points of other regions, as long as they can be combined to evaluate the pronunciation mouth shape. The keypoint information may include coordinates of at least one mouth keypoint. Before the key point detection, the face area can be extended to the outside. The expansion can increase the robustness. If it is not extended, some points will be predicted to the edge of the picture, causing a large error. After the expansion, you can Better return to the correct face. Specifically, x ₁ =x ₁ -(x ₂ -x ₁ )×0.2, x ₂ =x ₂ +(x ₂ -x ₁ )×0.2, y ₁ =y ₁ -(y ₂ -y ₁ )×0.2 , y ₂ =y ₂ +(y ₂ -y ₁ )×0.2, after the above expansion operation, the face is cut out, and the cut face is scaled to the preset size in the model, that is, to adapt to the size of the model The size, for example, can be to scale the face to a size of 112×112, and then divide the scaled image by 256. After the above processing, the processed image is input into the face key point detection model, and finally 106 key points of the outline of the face organs are obtained. Of course, this is only an exemplary illustration, and the number of key points can also be 100 or 50, which is not limited in this application. As shown in FIG. 2b, FIG. 2b is a schematic diagram of a face key point according to Embodiment 2 of the present application.

步骤209、根据关键点信息在图像帧的人脸区域提取嘴部图像，并进行口型识别。Step 209 , extract a mouth image from the face region of the image frame according to the key point information, and perform mouth shape recognition.

以目标声音是英文发音为例，根据英文国际发音标准的定义，将口型分为48种类别，利用口型图像样本，以及对样本标注的类别数据可以对口型识别模型进行训练，还可以加上一种不属于所有口型的其他类别，共计49个类别。Taking the English pronunciation as an example, according to the definition of the English international pronunciation standard, mouth shapes are divided into 48 categories. The mouth shape recognition model can be trained by using the mouth shape image samples and the category data marked on the samples. The previous category does not belong to all other categories of mouth shapes, for a total of 49 categories.

将嘴部图像缩放到64×64的大小，并除以256，然后将嘴部图像输入到口型识别模型得到图像帧序列中各图像帧中嘴部的口型。The mouth image is scaled to a size of 64×64 and divided by 256, and then the mouth image is input into the mouth shape recognition model to obtain the mouth shape of the mouth in each image frame in the image frame sequence.

步骤210、将图像帧序列与参考帧序列对齐。Step 210: Align the image frame sequence with the reference frame sequence.

需要说明的是，步骤209之后，图像帧序列中的图像帧可以只保留嘴部图像，也可以保留完整图像，本申请对此不作限制。每一个图像帧对应一个嘴部的口型类别，图像帧序列也可以用口型类别序列替换，当然，其作用是相同的，本申请对此不作限制。需要说明的是，图像帧序列与参考帧序列对齐可以在步骤206之后执行，即在发音视频中截取图像帧序列之后，就将图像帧序列与参考帧序列对齐，也可以在步骤209对图像帧提取嘴部图像之后，将图像帧序列与参考帧序列对齐，本申请对此不作限制，只要在确定目标发音口型与预设的标准发音口型的相似度之前，将图像帧序列与参考帧序列对齐即可。It should be noted that, after step 209, the image frames in the image frame sequence may only retain the mouth image, or may retain the complete image, which is not limited in this application. Each image frame corresponds to a mouth shape category, and the image frame sequence can also be replaced with a mouth shape category sequence. Of course, its function is the same, which is not limited in this application. It should be noted that the alignment of the image frame sequence and the reference frame sequence can be performed after step 206, that is, after the image frame sequence is intercepted in the pronunciation video, the image frame sequence is aligned with the reference frame sequence, or the image frame sequence can be aligned in step 209. After extracting the mouth image, the image frame sequence is aligned with the reference frame sequence, which is not limited in this application, as long as the similarity between the target pronunciation mouth shape and the preset standard pronunciation mouth shape is determined, the image frame sequence is aligned with the reference frame. Align the sequences.

因为不同的学生，在读相同单词或句子所用的时间不同，因此帧数就与数据库里的参考帧序列(即标准发音帧数)不同，无法做口型的逐帧对比，因此，将图像帧序列进行降采样或者线性插值，以保证与参考帧序列具有相同的帧数。Because different students spend different time reading the same word or sentence, the number of frames is different from the reference frame sequence in the database (that is, the number of standard pronunciation frames), and the frame-by-frame comparison of mouth shapes cannot be done. Downsampling or linear interpolation is performed to ensure the same number of frames as the reference frame sequence.

由于一个单词或者语句，很可能在读的时候前半部分读的快，后半部分读的慢，因此，均匀降采样或者线性差值，将会导致读的快的部分帧数过少，进而会导致识别效果变差，因此可以采用非均匀的降采样或线性差值。Since a word or sentence is likely to be read fast in the first half and slow in the second half, uniform downsampling or linear difference will result in too few frames in the fast part, which will lead to The recognition performance becomes poor, so non-uniform downsampling or linear difference can be used.

例如，可以将数据库中的标准发音和采集到的目标对象的发音进行音素分解，例如，“congratulation”这个单词，经过音素分解后，会被按照音素分成con'gra'tu'la'tion，图2c为本申请实施例二提供的一种音素分解效果示意图，将目标对象的发音音素与标准发音的音素对齐。For example, the standard pronunciation in the database and the collected pronunciation of the target object can be phoneme-decomposed. For example, the word "congratulation" will be divided into con'gra'tu'la'tion according to the phoneme after the phoneme decomposition. Fig. 2c is a schematic diagram of a phoneme decomposition effect provided in the second embodiment of the present application, where the pronunciation phoneme of the target object is aligned with the phoneme of the standard pronunciation.

例如，如果目标对象发音的con这个音素为10帧，标准发音为8帧，就将目标对象的图像帧序列进行降采样；如果目标对象发音为8帧，标准发音为10帧，就将目标对象的图像帧序列进行线性插值，以保持帧数完全一致。For example, if the phoneme of con pronounced by the target object is 10 frames, and the standard pronunciation is 8 frames, the image frame sequence of the target object is down-sampled; if the target object is pronounced 8 frames, the standard pronunciation is 10 frames, the target object The sequence of image frames is linearly interpolated to keep the number of frames exactly the same.

步骤211、确定目标发音口型与预设的标准发音口型的相似度，并生成第二评测数据。Step 211: Determine the similarity between the target pronunciation mouth shape and the preset standard pronunciation mouth shape, and generate second evaluation data.

例如，可以利用欧式距离计算图像帧序列与参考帧序列的相似度。对于每一个图像帧，都与参考帧序列中的参考帧进行了对比，计算出相似度，也就是图像帧序列中每一个图像帧都进行了评测，利用每个图像帧与参考帧的相似度即可生成第二评测数据。For example, the Euclidean distance can be used to calculate the similarity between a sequence of image frames and a sequence of reference frames. For each image frame, it is compared with the reference frame in the reference frame sequence, and the similarity is calculated, that is, each image frame in the image frame sequence is evaluated, and the similarity between each image frame and the reference frame is used. The second evaluation data can be generated.

步骤212、根据图像帧序列确定图像帧序列的特征矩阵。Step 212: Determine a feature matrix of the image frame sequence according to the image frame sequence.

图像帧序列的特征矩阵可以由至少一个图像帧的特征向量构成。例如，The feature matrix of the sequence of image frames may consist of feature vectors of at least one image frame. E.g,

将嘴部关键点提取出来后，每一个图像帧中的嘴部关键点可以表示为(20，2)，其中，20为关键点的数量，2表示每个关键点从x、y两个坐标维度描述，即相当于20个嘴部坐标，一个图像帧中的嘴部关键点的坐标即可形成一个图像帧的特征向量。当然，此处只是以20个关键点为例进行说明，并不代表本申请局限于此。After the mouth key points are extracted, the mouth key points in each image frame can be expressed as (20, 2), where 20 is the number of key points, and 2 means that each key point has two coordinates from x and y. The dimensional description is equivalent to 20 mouth coordinates, and the coordinates of the key points of the mouth in an image frame can form the feature vector of an image frame. Of course, only 20 key points are used as an example for description, which does not mean that the present application is limited to this.

将所有的图像帧的特征向量，按照顺序组合成一个特征矩阵，即为图像帧序列的特征矩阵。如果图像帧序列中图像帧的数量为N，则特征矩阵可以表示为S(N，20，2)，N表示图像帧的数量；如果参考帧序列有M个参考帧，则参考帧序列的特征矩阵可以表示为T(M，20，2)，M为参考帧序列的数量。The feature vectors of all image frames are combined into a feature matrix in order, which is the feature matrix of the image frame sequence. If the number of image frames in the image frame sequence is N, the feature matrix can be expressed as S(N, 20, 2), where N represents the number of image frames; if the reference frame sequence has M reference frames, the feature matrix of the reference frame sequence The matrix can be represented as T(M, 20, 2), where M is the number of reference frame sequences.

进一步的，例如，数据库中有J个句子、单词的标准发音，为每一个标准发音的参考帧序列构建特征矩阵，并构建一维模型。具体的，可以将一个标准发音的20个嘴部关键点坐标中作为输入，人工标注的得分作为输出，用10个神经元，拟合出一条该标准发音的标准发音曲线模型，该标准发音曲线模型既可用于检测该标准发音这一类型的发音口型的准确性。数据库中可以有J个标准发音曲线模型。Further, for example, there are J standard pronunciations of sentences and words in the database, a feature matrix is constructed for each standard pronunciation reference frame sequence, and a one-dimensional model is constructed. Specifically, the coordinates of 20 key points of the mouth of a standard pronunciation can be used as input, the score of manual annotation can be used as output, and 10 neurons can be used to fit a standard pronunciation curve model of the standard pronunciation. The standard pronunciation curve The model can both be used to detect the accuracy of the standard pronunciation of this type of mouth shape. There can be J standard pronunciation curve models in the database.

步骤213、将图像帧的特征向量输入标准发音曲线模型得到图像帧的口型评分，并进行归一化处理生成第一评测数据。Step 213: Input the feature vector of the image frame into the standard pronunciation curve model to obtain the mouth shape score of the image frame, and perform normalization processing to generate the first evaluation data.

第一评测数据即可表示连续变化的发音口型的评测结果。The first evaluation data can represent the evaluation result of the continuously changing pronunciation mouth shape.

步骤214、对第一评测数据、第二评测数据、第三评测数据进行加权运算并生成口型评测信息。Step 214: Perform a weighted operation on the first evaluation data, the second evaluation data, and the third evaluation data to generate mouth shape evaluation information.

将步骤205得到的第三评测数据，步骤211得到的第二评测数据，以及步骤213得到的第一评测数据进行加权处理，可以得到最终口型评测的结果，不仅从音频进行评测，而且对于每一帧的口型，对于连续的口型变化都进行了评测，综合起来评测得到的口型评测信息能够更加准确地评价发音口型的准确性。需要说明的是，对第一评测数据、第二评测数据、第三评测数据可以乘以相同的权重，也可以按照用户需求乘以不同的权重，本申请对此不作限制，例如，计算第一评测数据、第二评测数据与第三评测数据之和并生成口型评测信息，即每个评测数据的权重为1。The third evaluation data obtained in step 205, the second evaluation data obtained in step 211, and the first evaluation data obtained in step 213 are weighted to obtain the final mouth shape evaluation result. The mouth shape of one frame is evaluated for the continuous mouth shape changes, and the mouth shape evaluation information obtained by the comprehensive evaluation can more accurately evaluate the accuracy of the pronunciation mouth shape. It should be noted that the first evaluation data, the second evaluation data, and the third evaluation data may be multiplied by the same weight, or may be multiplied by different weights according to user needs, which is not limited in this application. The sum of the evaluation data, the second evaluation data, and the third evaluation data is used to generate mouth shape evaluation information, that is, the weight of each evaluation data is 1.

除了通过构建一维模型计算第一评测数据外，还可以通过以下方式计算：In addition to calculating the first evaluation data by building a one-dimensional model, it can also be calculated in the following ways:

将所有的图像帧的特征向量，按照顺序组合成一个特征矩阵，即为图像帧序列的特征矩阵。如果图像帧序列中图像帧的数量为N，则特征矩阵可以表示为S(N，20，2)，N表示图像帧的数量；如果参考帧序列有M个参考帧，则参考帧序列的特征矩阵可以表示为T(M，20，2)，M为参考帧序列的数量。预设多个转置矩阵，转置矩阵可以将要比较的两个矩阵统一到相同维度，例如，一个转置矩阵M×N与S(N，20，2)相乘后能得到与T(M，20，2)相同维度的矩阵，之后这两个矩阵做比较差异，将图像帧序列的特征矩阵点乘以该转置矩阵，再与参考帧序列的特征矩阵进行对比，得到评测结果。需要说明的是，两个矩阵做比较差异，可以计算两个矩阵的欧式距离，根据欧式距离确定两个矩阵的相似度。当然，此处只是示例性说明，并不代表本申请局限于此。The feature vectors of all image frames are combined into a feature matrix in order, which is the feature matrix of the image frame sequence. If the number of image frames in the image frame sequence is N, the feature matrix can be expressed as S(N, 20, 2), where N represents the number of image frames; if the reference frame sequence has M reference frames, the feature matrix of the reference frame sequence The matrix can be represented as T(M, 20, 2), where M is the number of reference frame sequences. Preset multiple transposed matrices. The transposed matrix can unify the two matrices to be compared to the same dimension. For example, a transposed matrix M×N multiplied by S(N, 20, 2) can be obtained by multiplying with T(M , 20, 2) Matrix of the same dimension, then compare the difference between these two matrices, multiply the feature matrix point of the image frame sequence by the transposed matrix, and then compare it with the feature matrix of the reference frame sequence to obtain the evaluation result. It should be noted that, by comparing the differences between the two matrices, the Euclidean distance of the two matrices can be calculated, and the similarity of the two matrices can be determined according to the Euclidean distance. Of course, this is only an exemplary description, which does not mean that the present application is limited thereto.

实施例三、Embodiment three,

本申请实施例提供一种口型评测设备，包括：获取模块301、矩阵模块302、评测模块303，综合模块3014；An embodiment of the present application provides a mouth shape evaluation device, including: an acquisition module 301, a matrix module 302, an evaluation module 303, and a synthesis module 3014;

其中，取模块301，用于获取目标对象的待评测数据，待评测数据包括图像帧序列，图像帧序列包括表示目标发音口型的连续的至少一个图像帧；Wherein, the obtaining module 301 is used to obtain the data to be evaluated of the target object, the data to be evaluated includes a sequence of image frames, and the sequence of image frames includes at least one continuous image frame representing the mouth shape of the target pronunciation;

矩阵模块302，用于根据图像帧序列确定图像帧序列的特征矩阵；a matrix module 302, configured to determine a feature matrix of the image frame sequence according to the image frame sequence;

评测模块303，用于根据图像帧序列的特征矩阵与预设模型得到第一评测数据，第一评测数据用于指示根据预设模型对特征矩阵评测的结果；An evaluation module 303, configured to obtain first evaluation data according to a feature matrix of the image frame sequence and a preset model, where the first evaluation data is used to indicate a result of evaluating the feature matrix according to the preset model;

综合模块304，用于根据评测数据生成目标对象的口型评测信息，评测数据包括第一评测数据，口型评测信息用于表示对目标对象的目标发音口型的评测结果。The synthesis module 304 is used for generating lip shape evaluation information of the target object according to the evaluation data, the evaluation data includes first evaluation data, and the lip shape evaluation information is used to represent the evaluation result of the target pronunciation lip shape of the target object.

可选地，在本申请的一种实施例中，矩阵模块302，还用于在图像帧序列包含的图像帧中，确定至少一个嘴部关键点的坐标；根据图像帧序列中图像帧的数量、各图像帧中嘴部关键点的数量以及坐标的维度，按照时间顺序生成图像帧序列的特征矩阵，图像帧序列的特征矩阵包括至少一个嘴部关键点的坐标。Optionally, in an embodiment of the present application, the matrix module 302 is further configured to determine the coordinates of at least one key point of the mouth in the image frames included in the image frame sequence; according to the number of image frames in the image frame sequence , the number of key points of the mouth in each image frame and the dimension of the coordinates, and the feature matrix of the sequence of image frames is generated in time sequence, and the feature matrix of the sequence of image frames includes the coordinates of at least one key point of the mouth.

可选地，在本申请的一种实施例中，预设模型包括标准发音曲线模型；评测模块303，还用于将至少一个嘴部关键点的坐标输入标准发音曲线模型，得到至少一个嘴部关键点的评分；根据至少一个嘴部关键点的评分得到图像帧的口型评分；对至少一个图像帧的口型评分进行归一化处理并生成第一评测数据。Optionally, in an embodiment of the present application, the preset model includes a standard pronunciation curve model; the evaluation module 303 is further configured to input the coordinates of at least one key point of the mouth into the standard pronunciation curve model to obtain at least one mouth Scoring of key points; obtaining a mouth shape score of the image frame according to the score of at least one mouth key point; normalizing the mouth shape score of at least one image frame to generate first evaluation data.

可选地，在本申请的一种实施例中，预设模型包括预设的转置矩阵，评测模块303，还用于将图像帧序列的特征矩阵与预设的转置矩阵进行点乘得到发音矩阵；将发音矩阵与预设的标准发音矩阵进行对比得到第一评测数据。Optionally, in an embodiment of the present application, the preset model includes a preset transpose matrix, and the evaluation module 303 is further configured to perform a dot product between the feature matrix of the image frame sequence and the preset transpose matrix to obtain Pronunciation matrix; compare the pronunciation matrix with the preset standard pronunciation matrix to obtain the first evaluation data.

可选地，在本申请的一种实施例中，评测模块303，还用于对图像帧序列进行口型评测，获取第二评测数据，第二评测数据用于表示图像帧中的目标发音口型与预设的标准发音口型的相似度；Optionally, in an embodiment of the present application, the evaluation module 303 is further configured to perform mouth shape evaluation on the image frame sequence, and obtain second evaluation data, where the second evaluation data is used to represent the target pronunciation mouth in the image frame. The similarity between the shape and the preset standard pronunciation mouth shape;

综合模块304，还用于对第一评测数据和第二评测数据进行加权运算并生成口型评测信息，评测数据包括第一评测数据和第二评测数据。The synthesis module 304 is further configured to perform a weighted operation on the first evaluation data and the second evaluation data and generate mouth shape evaluation information, where the evaluation data includes the first evaluation data and the second evaluation data.

可选地，在本申请的一种实施例中，评测模块303，还用于对图像帧序列中的至少一个图像帧进行口型识别；根据识别结果确定目标发音口型与预设的标准发音口型的相似度，并生成第二评测数据。Optionally, in an embodiment of the present application, the evaluation module 303 is further configured to perform mouth shape recognition on at least one image frame in the image frame sequence; determine the target pronunciation mouth shape and the preset standard pronunciation according to the recognition result. The similarity of the mouth shape, and generate the second evaluation data.

可选地，在本申请的一种实施例中，评测模块303，将图像帧序列与参考帧序列的帧数对齐，参考帧序列为预先设置的表示标准发音口型的帧序列。Optionally, in an embodiment of the present application, the evaluation module 303 aligns the frame numbers of the image frame sequence with the reference frame sequence, where the reference frame sequence is a preset frame sequence representing standard pronunciation mouth shapes.

可选地，在本申请的一种实施例中，评测模块303，还用于对图像帧序列进行降采样或者线性插值，以使得图像帧序列与参考帧序列的帧数对齐，参考帧序列为预先设置的表示标准发音口型的帧序列。Optionally, in an embodiment of the present application, the evaluation module 303 is further configured to perform downsampling or linear interpolation on the image frame sequence, so that the image frame sequence is aligned with the frame numbers of the reference frame sequence, and the reference frame sequence is: A preset sequence of frames representing standard pronunciation mouth shapes.

可选地，在本申请的一种实施例中，评测模块303，还用于根据预设的参考音频对音频数据进行评测生成第三评测数据，第三评测数据用于表示音频数据与参考音频的发音相似度；Optionally, in an embodiment of the present application, the evaluation module 303 is further configured to evaluate the audio data according to the preset reference audio to generate third evaluation data, where the third evaluation data is used to represent the audio data and the reference audio. pronunciation similarity;

综合模块304，还用于对第一评测数据和第三评测数据进行加权运算并生成口型评测信息，评测数据包括第一评测数据和第三评测数据。The synthesis module 304 is further configured to perform a weighted operation on the first evaluation data and the third evaluation data and generate mouth shape evaluation information, where the evaluation data includes the first evaluation data and the third evaluation data.

可选地，在本申请的一种实施例中，评测模块303，还用于将音频数据的起始时刻向前延展第一预设时长，并将音频数据的起始时刻向后延展第二预设时长，得到延展后的音频数据；根据预设的参考音频对延展后的音频数据进行评测生成第三评测数据。Optionally, in an embodiment of the present application, the evaluation module 303 is further configured to extend the start time of the audio data forward by a first preset duration, and extend the start time of the audio data backward by a second time. A preset time length is used to obtain extended audio data; and third evaluation data is generated by evaluating the extended audio data according to a preset reference audio.

实施例四、Embodiment four,

本申请实施例提供一种电子设备，包括：如图4所示，该电子设备包括：An embodiment of the present application provides an electronic device, including: as shown in FIG. 4 , the electronic device includes:

至少一个处理器(processor)402和存储器(memory)404。At least one processor (processor) 402 and memory (memory) 404.

其中：in:

存储器404，用于存放程序410。存储器404可能包含高速RAM存储器，也可能还包括非易失性存储器(non-volatile memory)，例如至少一个磁盘存储器。具体地，程序410可以包括程序代码，该程序代码包括计算机操作指令。The memory 404 is used to store the program 410 . Memory 404 may include high-speed RAM memory, and may also include non-volatile memory, such as at least one disk memory. Specifically, the program 410 may include program code including computer operation instructions.

处理器402，用于执行程序410，具体可以执行上述实施例一和实施例二所描述的方法中的相关步骤。The processor 402 is configured to execute the program 410, and specifically may execute the relevant steps in the methods described in the first embodiment and the second embodiment.

可选地，在本申请的一个实施例中，该电子设备还可以包括总线406及通信接口(Communications Interface)408。Optionally, in an embodiment of the present application, the electronic device may further include a bus 406 and a communications interface (Communications Interface) 408 .

处理器402、通信接口408、以及存储器404可以通过通信总线406完成相互间的通信。The processor 402 , the communication interface 408 , and the memory 404 can communicate with each other through the communication bus 406 .

通信接口408，用于与其它设备进行通信。A communication interface 408 for communicating with other devices.

处理器402可能是中央处理器CPU，或者是特定集成电路ASICThe processor 402 may be a central processing unit CPU, or a special integrated circuit ASIC

(Application Specific Integrated Circuit)，或者是被配置成实施本发明实施例的一个或多个集成电路。电子设备包括的一个或多个处理器，可以是同一类型的处理器，如一个或多个CPU；也可以是不同类型的处理器，如一个或多个CPU以及一个或多个ASIC。(Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included in the electronic device may be the same type of processors, such as one or more CPUs; or may be different types of processors, such as one or more CPUs and one or more ASICs.

实施例五、Embodiment five,

本申请实施例的口型评测设备以多种形式存在，包括但不限于：The mouth shape evaluation devices of the embodiments of the present application exist in various forms, including but not limited to:

(1)移动通信设备：这类设备的特点是具备移动通信功能，并且以提供话音、数据通信为主要目标。这类终端包括：智能手机(例如iPhone)、多媒体手机、功能性手机，以及低端手机等。(1) Mobile communication equipment: This type of equipment is characterized by having mobile communication functions, and its main goal is to provide voice and data communication. Such terminals include: smart phones (eg iPhone), multimedia phones, functional phones, and low-end phones.

(2)超移动个人计算机设备：这类设备属于个人计算机的范畴，有计算和处理功能，一般也具备移动上网特性。这类终端包括：PDA、MID和UMPC设备等，例如iPad。(2) Ultra-mobile personal computer equipment: This type of equipment belongs to the category of personal computers, has computing and processing functions, and generally has the characteristics of mobile Internet access. Such terminals include: PDAs, MIDs, and UMPC devices, such as iPads.

(3)便携式娱乐设备：这类设备可以显示和播放多媒体内容。该类设备包括：音频、视频播放器(例如iPod)，掌上游戏机，电子书，以及智能玩具和便携式车载导航设备。(3) Portable entertainment equipment: This type of equipment can display and play multimedia content. Such devices include: audio and video players (eg iPod), handheld game consoles, e-books, as well as smart toys and portable car navigation devices.

(4)其他具有数据交互功能的电子设备。(4) Other electronic devices with data interaction function.

至此，已经对本主题的特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下，在权利要求书中记载的动作可以按照不同的顺序来执行并且仍然可以实现期望的结果。另外，在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序，以实现期望的结果。在某些实施方式中，多任务处理和并行处理可以是有利的。So far, specific embodiments of the present subject matter have been described. Other embodiments are within the scope of the appended claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Additionally, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In certain embodiments, multitasking and parallel processing may be advantageous.

在20世纪90年代，对于一个技术的改进可以很明显地区分是硬件上的改进(例如，对二极管、晶体管、开关等电路结构的改进)还是软件上的改进(对于方法流程的改进)。然而，随着技术的发展，当今的很多方法流程的改进已经可以视为硬件电路结构的直接改进。设计人员几乎都通过将改进的方法流程编程到硬件电路中来得到相应的硬件电路结构。因此，不能说一个方法流程的改进就不能用硬件实体模块来实现。例如，可编程逻辑器件(Programmable Logic Device,PLD)(例如现场可编程门阵列(Field Programmable GateArray，FPGA))就是这样一种集成电路，其逻辑功能由用户对器件编程来确定。由设计人员自行编程来把一个数字系统“集成”在一片PLD上，而不需要请芯片制造厂商来设计和制作专用的集成电路芯片。而且，如今，取代手工地制作集成电路芯片，这种编程也多半改用“逻辑编译器(logic compiler)”软件来实现，它与程序开发撰写时所用的软件编译器相类似，而要编译之前的原始代码也得用特定的编程语言来撰写，此称之为硬件描述语言(Hardware Description Language，HDL)，而HDL也并非仅有一种，而是有许多种，如ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware DescriptionLanguage)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(RubyHardware Description Language)等，目前最普遍使用的是VHDL(Very-High-SpeedIntegrated Circuit Hardware Description Language)与Verilog。本领域技术人员也应该清楚，只需要将方法流程用上述几种硬件描述语言稍作逻辑编程并编程到集成电路中，就可以很容易得到实现该逻辑方法流程的硬件电路。In the 1990s, improvements in a technology could be clearly differentiated between improvements in hardware (eg, improvements to circuit structures such as diodes, transistors, switches, etc.) or improvements in software (improvements in method flow). However, with the development of technology, the improvement of many methods and processes today can be regarded as a direct improvement of the hardware circuit structure. Designers almost get the corresponding hardware circuit structure by programming the improved method flow into the hardware circuit. Therefore, it cannot be said that the improvement of a method flow cannot be realized by hardware entity modules. For example, a Programmable Logic Device (PLD) (eg, Field Programmable Gate Array (FPGA)) is an integrated circuit whose logic function is determined by user programming of the device. It is programmed by the designer to "integrate" a digital system on a PLD without having to ask the chip manufacturer to design and manufacture a dedicated integrated circuit chip. And, instead of making integrated circuit chips by hand, these days, much of this programming is done using software called a "logic compiler", which is similar to the software compiler used in program development and writing, but before compiling The original code also has to be written in a specific programming language, which is called Hardware Description Language (HDL), and there is not only one HDL, but many kinds, such as ABEL (Advanced Boolean Expression Language) , AHDL (Altera Hardware Description Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), Lava, Lola, MyHDL, PALASM, RHDL (RubyHardware Description Language), etc. The most commonly used are VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog. It should also be clear to those skilled in the art that a hardware circuit for implementing the logic method process can be easily obtained by simply programming the method process in the above-mentioned several hardware description languages and programming it into the integrated circuit.

控制器可以按任何适当的方式实现，例如，控制器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电路(Application Specific Integrated Circuit，ASIC)、可编程逻辑控制器和嵌入微控制器的形式，控制器的例子包括但不限于以下微控制器：ARC 625D、Atmel AT91SAM、Microchip PIC18F26K20以及Silicone Labs C8051F320，存储器控制器还可以被实现为存储器的控制逻辑的一部分。本领域技术人员也知道，除了以纯计算机可读程序代码方式实现控制器以外，完全可以通过将方法步骤进行逻辑编程来使得控制器以逻辑门、开关、专用集成电路、可编程逻辑控制器和嵌入微控制器等的形式来实现相同功能。因此这种控制器可以被认为是一种硬件部件，而对其内包括的用于实现各种功能的装置也可以视为硬件部件内的结构。或者甚至，可以将用于实现各种功能的装置视为既可以是实现方法的软件模块又可以是硬件部件内的结构。The controller may be implemented in any suitable manner, for example, the controller may take the form of eg a microprocessor or processor and a computer readable medium storing computer readable program code (eg software or firmware) executable by the (micro)processor , logic gates, switches, application specific integrated circuits (ASICs), programmable logic controllers and embedded microcontrollers, examples of controllers include but are not limited to the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicon Labs C8051F320, the memory controller can also be implemented as part of the control logic of the memory. Those skilled in the art also know that, in addition to implementing the controller in the form of pure computer-readable program code, the controller can be implemented as logic gates, switches, application-specific integrated circuits, programmable logic controllers and embedded devices by logically programming the method steps. The same function can be realized in the form of a microcontroller, etc. Therefore, such a controller can be regarded as a hardware component, and the devices included therein for realizing various functions can also be regarded as a structure within the hardware component. Or even, the means for implementing various functions can be regarded as both a software module implementing a method and a structure within a hardware component.

上述实施例阐明的系统、装置、模块或单元，具体可以由计算机芯片或实体实现，或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的，计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。The systems, devices, modules or units described in the above embodiments may be specifically implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or A combination of any of these devices.

为了描述的方便，描述以上装置时以功能分为各种单元分别描述。当然，在实施本申请时可以把各单元的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, when describing the above device, the functions are divided into various units and described respectively. Of course, when implementing the present application, the functions of each unit may be implemented in one or more software and/or hardware.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

在一个典型的配置中，计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

内存可能包括计算机可读介质中的非永久性存储器，随机存取存储器(RAM)和/或非易失性内存等形式，如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。Memory may include non-persistent memory in computer readable media, random access memory (RAM) and/or non-volatile memory in the form of, for example, read only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括，但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带，磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质，可用于存储可以被计算设备访问的信息。按照本文中的界定，计算机可读介质不包括暂存电脑可读媒体(transitory media)，如调制的数据信号和载波。Computer-readable media includes both persistent and non-permanent, removable and non-removable media, and storage of information may be implemented by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media does not include transitory computer-readable media, such as modulated data signals and carrier waves.

还需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device comprising a series of elements includes not only those elements, but also Other elements not expressly listed, or which are inherent to such a process, method, article of manufacture, or apparatus are also included. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article of manufacture, or device that includes the element.

本领域技术人员应明白，本申请的实施例可提供为方法、系统或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。It will be appreciated by those skilled in the art that the embodiments of the present application may be provided as a method, a system or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述，例如程序模块。一般地，程序模块包括执行特定事务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请，在这些分布式计算环境中，由通过通信网络而被连接的远程处理设备来执行事务。在分布式计算环境中，程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular transactions or implement particular abstract data types. The application may also be practiced in distributed computing environments where transactions are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.

本说明书中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于系统实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a progressive manner, and the same and similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, as for the system embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and for related parts, please refer to the partial descriptions of the method embodiments.

以上所述仅为本申请的实施例而已，并不用于限制本申请。对于本领域技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等，均应包含在本申请的权利要求范围之内。The above descriptions are merely examples of the present application, and are not intended to limit the present application. Various modifications and variations of this application are possible for those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included within the scope of the claims of this application.

Claims

1. A method for mouth shape evaluation, comprising:

acquiring data to be evaluated of a target object, wherein the data to be evaluated comprises an image frame sequence, and the image frame sequence comprises at least one continuous image frame representing a target pronunciation mouth shape;

determining a feature matrix of the image frame sequence from the image frame sequence;

obtaining first evaluation data according to a feature matrix of the image frame sequence and a preset model, wherein the first evaluation data is used for indicating a result of evaluation of the feature matrix according to the preset model;

performing mouth shape evaluation on the image frame sequence to obtain second evaluation data, wherein the second evaluation data is used for representing the similarity between a target pronunciation mouth shape in the image frame and the preset standard pronunciation mouth shape;

evaluating audio data according to a preset reference audio to generate third evaluation data, wherein the third evaluation data is used for representing the pronunciation similarity of the audio data and the reference audio;

generating mouth shape evaluation information of the target object according to evaluation data, wherein the evaluation data comprise the first evaluation data, and the mouth shape evaluation information is used for expressing an evaluation result of a target pronunciation mouth shape of the target object;

generating mouth shape evaluation information of the target object according to evaluation data, wherein the mouth shape evaluation information comprises the following steps:

and performing weighted operation on the first evaluation data and the second evaluation data to generate the mouth shape evaluation information, or performing weighted operation on the first evaluation data and the third evaluation data to generate the mouth shape evaluation information, or performing weighted operation on the first evaluation data, the second evaluation data and the third evaluation data to generate the mouth shape evaluation information.

2. The method of claim 1, wherein determining a feature matrix for the sequence of image frames from the sequence of image frames comprises:

determining coordinates of at least one mouth key point in image frames contained in the image frame sequence;

and generating a feature matrix of the image frame sequence according to the number of the image frames in the image frame sequence, the number of the mouth key points in each image frame and the dimension of the coordinates in a time sequence, wherein the feature matrix of the image frame sequence comprises the coordinates of the at least one mouth key point.

3. The method of claim 2, wherein the preset model comprises a standard pronunciation curve model; the obtaining of the first evaluation data according to the feature matrix of the image frame sequence and a preset model comprises:

inputting the coordinates of the at least one mouth key point into the standard pronunciation curve model to obtain the score of the at least one mouth key point;

obtaining a mouth shape score of the image frame according to the score of the at least one mouth key point;

and normalizing the mouth shape score of at least one image frame and generating the first evaluation data.

4. The method according to claim 1, wherein the preset model includes a preset transpose matrix, and the obtaining of the first evaluation data according to the feature matrix of the image frame sequence and the preset model includes:

performing dot multiplication on the feature matrix of the image frame sequence and a preset transposed matrix to obtain a pronunciation matrix;

and comparing the pronunciation matrix with a preset standard pronunciation matrix to obtain the first evaluation data.

5. The method of claim 1, further comprising:

performing mouth shape recognition on at least one image frame in the image frame sequence;

and determining the similarity between the target pronunciation mouth shape and a preset standard pronunciation mouth shape according to the recognition result, and generating second evaluation data.

6. The method according to claim 5, wherein before determining the similarity between the target pronunciation mouth shape and the preset standard pronunciation mouth shape according to the recognition result, the method further comprises:

aligning the image frame sequence with the frame number of a reference frame sequence, wherein the reference frame sequence is a preset frame sequence representing a standard pronunciation mouth shape.

7. The method of claim 6, wherein aligning the sequence of image frames with the number of frames of a sequence of reference frames comprises:

down-sampling or linearly interpolating the sequence of image frames to align the sequence of image frames with the number of frames of a sequence of reference frames.

8. The method according to claim 1, wherein evaluating the audio data according to a predetermined reference audio to generate third evaluation data comprises:

extending the initial time of the audio data forwards for a first preset time length, and extending the initial time of the audio data backwards for a second preset time length to obtain extended audio data;

and evaluating the extended audio data according to a preset reference audio to generate third evaluation data.

9. An oral evaluation device, comprising: the system comprises an acquisition module, a matrix module, an evaluation module and a comprehensive module;

the system comprises a fetching module, a judging module and a judging module, wherein the fetching module is used for acquiring data to be evaluated of a target object, the data to be evaluated comprises an image frame sequence, and the image frame sequence comprises at least one continuous image frame representing a target pronunciation mouth shape;

the matrix module is used for determining a characteristic matrix of the image frame sequence according to the image frame sequence;

the evaluation module is used for obtaining first evaluation data according to a feature matrix of the image frame sequence and a preset model, and the first evaluation data is used for indicating a result of evaluating the feature matrix according to the preset model;

the evaluating module is further configured to evaluate the mouth shape of the image frame sequence to obtain second evaluating data, where the second evaluating data is used to represent the similarity between a target pronunciation mouth shape in the image frame and the preset standard pronunciation mouth shape, and evaluate audio data according to a preset reference audio to generate third evaluating data, and the third evaluating data is used to represent the pronunciation similarity between the audio data and the reference audio;

the comprehensive module is used for generating mouth shape evaluation information of the target object according to evaluation data, the evaluation data comprise the first evaluation data, and the mouth shape evaluation information is used for representing an evaluation result of a target pronunciation mouth shape of the target object;

the comprehensive module is further configured to perform a weighted operation on the first evaluation data and the second evaluation data to generate the mouth shape evaluation information, or perform a weighted operation on the first evaluation data and the third evaluation data to generate the mouth shape evaluation information, or perform a weighted operation on the first evaluation data, the second evaluation data, and the third evaluation data to generate the mouth shape evaluation information.

10. An electronic device, comprising: a processor and a memory, said processor being connected to said memory, said memory storing a computer program, said processor being adapted to execute said computer program to implement the method for mouth shape assessment according to any of claims 1-7.

11. A computer storage medium, comprising: the computer storage medium stores a computer program which, when executed by a processor, implements a method of mouth shape evaluation according to any one of claims 1-7.