CN117636843A

CN117636843A - Speech recognition method, device, electronic equipment and storage medium

Info

Publication number: CN117636843A
Application number: CN202311620137.2A
Authority: CN
Inventors: 汪浩; 薛征山
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2023-11-28
Filing date: 2023-11-28
Publication date: 2024-03-01

Abstract

The embodiment of the application provides a voice recognition method, a voice recognition device, an electronic device and a storage medium. The embodiment of the application provides a multi-task voice recognition model comprising a plurality of voice recognition models, and simultaneously integrates a plurality of loss functions to jointly enhance the voice recognition precision of the multi-task voice recognition model.

Description

Speech recognition method, device, electronic equipment and storage medium

技术领域Technical field

本申请涉及语音处理技术领域，具体涉及一种语音识别方法、装置、电子设备及存储介质。This application relates to the field of speech processing technology, and specifically to a speech recognition method, device, electronic equipment and storage medium.

背景技术Background technique

语音翻译能够将一种语言的口语输入转换为另一种语言的翻译文本的自动翻译技术。语音翻译技术可以帮助人们跨越语言障碍进行实时交流，适用于会议、商务洽谈、旅行交流等场景。此外，语音翻译还应用于多语种服务行业，提供更好的客户体验。它也用于语言学习，辅助听障者交流，并被智能设备和语音助手广泛采用。语音翻译技术的发展将进一步拓展其应用场景，促进国际交流与合作，为人们提供更便利的跨语言交流方式。Speech translation is an automatic translation technology that converts spoken input in one language into translated text in another language. Voice translation technology can help people communicate in real time across language barriers, and is suitable for meetings, business negotiations, travel communication and other scenarios. In addition, voice translation is also used in multilingual service industries to provide a better customer experience. It is also used for language learning, assisting the hearing-impaired in communicating, and is widely adopted by smart devices and voice assistants. The development of speech translation technology will further expand its application scenarios, promote international exchanges and cooperation, and provide people with more convenient cross-language communication methods.

传统的语音翻译通常采用级联的方法，通过将自动语音识别(ASR)模型和机器翻译(MT)模型顺序连接来实现，然而这种级联的方法可能会遭受高延迟、错误传播和大量参数等问题，且语音识别的精度不高。Traditional speech translation usually adopts a cascaded approach, which is achieved by sequentially connecting an automatic speech recognition (ASR) model and a machine translation (MT) model. However, this cascaded approach may suffer from high latency, error propagation, and a large number of parameters. and other problems, and the accuracy of speech recognition is not high.

发明内容Contents of the invention

本申请实施例提供一种语音识别方法、装置、电子设备及存储介质，可以提高语音识别的准确度。Embodiments of the present application provide a speech recognition method, device, electronic device and storage medium, which can improve the accuracy of speech recognition.

第一方面，本申请提供的语音识别方法，包括：获取初始多任务语音识别模型，所述初始多任务语音识别模型包括第一初始语音识别模型和第二初始语音识别模型；In a first aspect, the speech recognition method provided by this application includes: obtaining an initial multi-task speech recognition model, where the initial multi-task speech recognition model includes a first initial speech recognition model and a second initial speech recognition model;

利用所述初始多任务语音识别模型识别预设语音样本序列，得到所述第一初始语音识别模型识别的第一语音识别结果和所述第二初始语音识别模型识别的第二语音识别结果；Utilize the initial multi-task speech recognition model to identify a preset speech sample sequence to obtain a first speech recognition result recognized by the first initial speech recognition model and a second speech recognition result recognized by the second initial speech recognition model;

利用第一损失函数集合中的多个损失函数计算所述第一语音识别结果的第一模型损失，以及利用第二损失函数集合中的多个损失函数计算所述第二语音识别结果的第二模型损失；A first model loss of the first speech recognition result is calculated using a plurality of loss functions in a first loss function set, and a second model loss of the second speech recognition result is calculated using a plurality of loss functions in a second loss function set. model loss;

利用所述第一模型损失和所述第二模型损失更新所述初始多任务语音识别模型，得到更新后的多任务语音识别模型；Update the initial multi-task speech recognition model using the first model loss and the second model loss to obtain an updated multi-task speech recognition model;

利用更新后的多任务语音识别模型进行语音识别。Speech recognition using an updated multi-task speech recognition model.

在一可选的实施例中，所述第一初始语音识别模型包括第一语音识别模块和第二语音识别模块，所述第二初始语音识别模型包括第三语音识别模块和第四语音识别模块；In an optional embodiment, the first initial speech recognition model includes a first speech recognition module and a second speech recognition module, and the second initial speech recognition model includes a third speech recognition module and a fourth speech recognition module. ;

所述利用所述初始多任务语音识别模型识别预设语音样本序列，得到所述第一初始语音识别模型对应的第一语音识别结果和所述第二初始语音识别模型对应的第二语音识别结果，包括：and using the initial multi-task speech recognition model to identify a preset speech sample sequence to obtain a first speech recognition result corresponding to the first initial speech recognition model and a second speech recognition result corresponding to the second initial speech recognition model. ,include:

利用所述第一语音识别模块识别所述预设语音样本序列，得到语音转录文本；Use the first speech recognition module to identify the preset speech sample sequence and obtain the speech transcription text;

利用所述第二语音识别模块识别所述预设语音样本序列，得到第一翻译文本；Use the second speech recognition module to identify the preset speech sample sequence to obtain the first translated text;

利用所述第三语音识别模块识别所述预设语音样本序列，得到第二翻译文本；Use the third speech recognition module to identify the preset speech sample sequence to obtain a second translated text;

利用所述第四语音识别模块识别所述语音转录文本，得到第三翻译文本；Using the fourth speech recognition module to recognize the speech transcription text, obtain a third translated text;

其中，所述第一语音识别结果包括所述语音转录文本和所述第一翻译文本，所述第二语音识别结果包括所述第二翻译文本和所述第三翻译文本。Wherein, the first speech recognition result includes the speech transcription text and the first translated text, and the second speech recognition result includes the second translated text and the third translated text.

在一可选的实施例中，所述第一损失函数集合中包括第一损失函数、第二损失函数、第三损失函数和第四损失函数；In an optional embodiment, the first loss function set includes a first loss function, a second loss function, a third loss function and a fourth loss function;

所述利用第一损失函数集合计算所述第一语音识别结果的第一模型损失，以及利用第二损失函数集合计算所述第二语音识别结果的第二模型损失，包括：The method of using the first set of loss functions to calculate the first model loss of the first speech recognition result, and using the second set of loss functions to calculate the second model loss of the second speech recognition result, includes:

利用所述第一损失函数计算所述语音转录文本的第三模型损失；Using the first loss function to calculate a third model loss of the speech transcription text;

利用所述第二损失函数计算所述第一翻译文本的第四模型损失；Calculating a fourth model loss of the first translated text using the second loss function;

以所述第一初始语音识别模型为教师模型，所述第二初始语音识别模型为学生模型，利用所述第三损失函数计算所述第一初始语音识别模型和所述第二初始语音识别模型之间的第五模型损失；Using the first initial speech recognition model as a teacher model and the second initial speech recognition model as a student model, the third loss function is used to calculate the first initial speech recognition model and the second initial speech recognition model. fifth model loss between;

利用所述第四损失函数计算所述语音转录文本和所述第一翻译文本之间的第六模型损失；using the fourth loss function to calculate a sixth model loss between the speech transcription text and the first translated text;

其中，所述第一模型损失包括所述第三模型损失、第四模型损失、第五模型损失和第六模型损失。Wherein, the first model loss includes the third model loss, the fourth model loss, the fifth model loss and the sixth model loss.

在一可选的实施例中，所述第二损失函数集合中包括第五损失函数、第六损失函数、第七损失函数和第八损失函数；In an optional embodiment, the second loss function set includes a fifth loss function, a sixth loss function, a seventh loss function and an eighth loss function;

所述利用第一损失函数集合计算所述第一语音识别结果的第一模型损失，以及利用第二损失函数集合计算所述第二语音识别结果的第二模型损失，还包括：The method of using a first set of loss functions to calculate a first model loss of the first speech recognition result, and using a second set of loss functions to calculate a second model loss of the second speech recognition result, further includes:

利用所述第五损失函数计算所述第二翻译文本的第七模型损失；Calculating a seventh model loss of the second translated text using the fifth loss function;

利用所述第六损失函数计算所述第三翻译文本的第八模型损失；Calculating an eighth model loss of the third translated text using the sixth loss function;

以所述第一初始语音识别模型为学生模型，所述第二初始语音识别模型为教师模型，利用所述第七损失函数计算所述第一初始语音识别模型和所述第二初始语音识别模型之间的第九模型损失；Taking the first initial speech recognition model as the student model and the second initial speech recognition model as the teacher model, the seventh loss function is used to calculate the first initial speech recognition model and the second initial speech recognition model. ninth model loss between;

利用所述第八损失函数计算所述第二翻译文本和所述第三翻译文本之间的第十模型损失；Calculating a tenth model loss between the second translated text and the third translated text using the eighth loss function;

其中，所述第二模型损失包括所述第七模型损失、第八模型损失、第九模型损失和第十模型损失。Wherein, the second model loss includes the seventh model loss, the eighth model loss, the ninth model loss and the tenth model loss.

在一可选的实施例中，所述利用所述第一模型损失和所述第二模型损失更新所述初始多任务语音识别模型，得到更新后的多任务语音识别模型，包括：In an optional embodiment, updating the initial multi-task speech recognition model using the first model loss and the second model loss to obtain an updated multi-task speech recognition model includes:

根据所述第一模型损失计算所述第一初始语音识别模型中的第一梯度参数；Calculate a first gradient parameter in the first initial speech recognition model based on the first model loss;

根据所述第二模型损失计算所述第二初始语音识别模型中的第二梯度参数；Calculate a second gradient parameter in the second initial speech recognition model based on the second model loss;

根据所述第一梯度参数和所述第二梯度参数同时更新所述初始多任务语音识别模型的模型参数，得到更新后的多任务语音识别模型。Simultaneously update the model parameters of the initial multi-task speech recognition model according to the first gradient parameter and the second gradient parameter to obtain an updated multi-task speech recognition model.

在一可选的实施例中，所述利用更新后的多任务语音识别模型进行语音识别，包括：In an optional embodiment, using the updated multi-task speech recognition model for speech recognition includes:

判断更新后的多任务语音识别模型的模型损失是否满足目标损失标准；Determine whether the model loss of the updated multi-task speech recognition model meets the target loss standard;

若更新后的多任务语音识别模型的模型损失满足所述目标损失标准，则利用更新后的多任务语音识别模型进行语音识别。If the model loss of the updated multi-task speech recognition model meets the target loss standard, the updated multi-task speech recognition model is used for speech recognition.

在一可选的实施例中，更新后的多任务语音识别模型包括第一语音识别模型和第二语音识别模型；In an optional embodiment, the updated multi-task speech recognition model includes a first speech recognition model and a second speech recognition model;

所述若更新后的多任务语音识别模型的模型损失满足目标模型损失标准，利用更新后的多任务语音识别模型进行语音识别，包括：If the model loss of the updated multi-task speech recognition model meets the target model loss standard, using the updated multi-task speech recognition model for speech recognition includes:

获取待识别的目标语音样本；Obtain the target speech sample to be recognized;

利用所述第二语音识别模型识别所述目标语音样本，得到所述目标语音样本对应的目标语音表示序列；Using the second speech recognition model to identify the target speech sample, obtain a target speech representation sequence corresponding to the target speech sample;

利用所述第二语音识别模型确定所述目标语音表示序列对应的目标语音文本的预测分布；using the second speech recognition model to determine the predicted distribution of the target speech text corresponding to the target speech representation sequence;

对所述目标语音文本的预测分布进行映射，得到所述目标语音样本的目标翻译文本。The predicted distribution of the target speech text is mapped to obtain the target translation text of the target speech sample.

第二方面，本申请提供的语音识别装置，包括：In the second aspect, the speech recognition device provided by this application includes:

模型获取模块，用于获取初始多任务语音识别模型，所述初始多任务语音识别模型包括第一初始语音识别模型和第二初始语音识别模型；A model acquisition module, configured to acquire an initial multi-task speech recognition model, where the initial multi-task speech recognition model includes a first initial speech recognition model and a second initial speech recognition model;

第一语音识别模块，用于利用所述初始多任务语音识别模型识别预设语音样本序列，得到所述第一初始语音识别模型识别的第一语音识别结果和所述第二初始语音识别模型识别的第二语音识别结果；A first speech recognition module, configured to use the initial multi-task speech recognition model to recognize a preset speech sample sequence, and obtain the first speech recognition result recognized by the first initial speech recognition model and the second initial speech recognition model recognition The second speech recognition result;

模型损失计算模块，用于利用第一损失函数集合中的多个损失函数计算所述第一语音识别结果的第一模型损失，以及利用第二损失函数集合中的多个损失函数计算所述第二语音识别结果的第二模型损失；A model loss calculation module, configured to calculate the first model loss of the first speech recognition result using a plurality of loss functions in a first loss function set, and to calculate the first model loss of the first speech recognition result using a plurality of loss functions in a second loss function set. Second model loss for second speech recognition results;

模型更新模块，用于利用所述第一模型损失和所述第二模型损失更新所述初始多任务语音识别模型，得到更新后的多任务语音识别模型；A model update module, configured to update the initial multi-task speech recognition model using the first model loss and the second model loss to obtain an updated multi-task speech recognition model;

第二语音识别模块，用于利用更新后的多任务语音识别模型进行语音识别。The second speech recognition module is used for speech recognition using the updated multi-task speech recognition model.

第三方面，本申请提供的电子设备，包括存储器和处理器，存储器存储有计算机程序，处理器用于运行存储器内的计算机程序，实现本申请所提供的语音识别方法中的步骤。In a third aspect, the electronic device provided by this application includes a memory and a processor. The memory stores a computer program, and the processor is used to run the computer program in the memory to implement the steps in the speech recognition method provided by this application.

第四方面，本申请提供的计算机可读存储介质，存储有多条指令，该指令适于处理器进行加载，实现本申请所提供的语音识别方法中的步骤。In the fourth aspect, the computer-readable storage medium provided by this application stores a plurality of instructions, which are suitable for loading by the processor to implement the steps of the speech recognition method provided by this application.

本申请实施例提供一种语音识别方法、装置、电子设备及存储介质，本申请提供的语音识别方法，利用包括第一初始语音识别模型和第二初始语音识别模型的初始多任务语音识别模型进行模型训练，同时在训练过程中利用多个不同的损失函数计算模型损失，利用多个模型损失同时更新初始多任务语音识别模型中的模型参数，得到训练后多任务语音识别模型。本申请实施例提供包括多个语音识别模型多的多任务语音识别模型，同时整合多个损失函数共同增强多任务语音识别模型的语音识别精度。Embodiments of the present application provide a speech recognition method, device, electronic device and storage medium. The speech recognition method provided by the present application uses an initial multi-task speech recognition model including a first initial speech recognition model and a second initial speech recognition model. Model training, while using multiple different loss functions to calculate model losses during the training process, using multiple model losses to simultaneously update the model parameters in the initial multi-task speech recognition model, and obtain the post-training multi-task speech recognition model. Embodiments of the present application provide a multi-task speech recognition model including multiple speech recognition models, and simultaneously integrate multiple loss functions to jointly enhance the speech recognition accuracy of the multi-task speech recognition model.

附图说明Description of drawings

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can also be obtained based on these drawings without exerting creative efforts.

图1是本申请实施例提供的语音识别系统的示意图；Figure 1 is a schematic diagram of a speech recognition system provided by an embodiment of the present application;

图2是本申请实施例提供的语音识别方法的一个实施例的流程示意图；Figure 2 is a schematic flow chart of one embodiment of the speech recognition method provided by the embodiment of the present application;

图3是本申请实施例提供的得到第一语音识别结果和第二语音识别结果的一个实施例流程示意图；Figure 3 is a schematic flowchart of an embodiment of obtaining the first speech recognition result and the second speech recognition result provided by the embodiment of the present application;

图4本申请实施例提供的初始语音识别模型的一个实施例架构示意图；Figure 4 is a schematic architectural diagram of an embodiment of the initial speech recognition model provided by the embodiment of this application;

图5是本申请实施例提供的计算第一模型损失和第二模型损失的一个实施例流程示意图；Figure 5 is a schematic flow chart of an embodiment of calculating the first model loss and the second model loss provided by the embodiment of the present application;

图6是本申请实施例提供的得到更新后的多任务语音识别模型的实施例流程示意图；Figure 6 is a schematic flowchart of an embodiment of obtaining an updated multi-task speech recognition model provided by an embodiment of the present application;

图7是本申请实施例提供的语音识别的一个实施例流程示意图；Figure 7 is a schematic flowchart of an embodiment of speech recognition provided by an embodiment of the present application;

图8是本申请实施例提供的语音识别模型的训练装置的结构示意图；Figure 8 is a schematic structural diagram of a speech recognition model training device provided by an embodiment of the present application;

图9示出了本申请实施例提供的电子设备的结构示意图。Figure 9 shows a schematic structural diagram of an electronic device provided by an embodiment of the present application.

具体实施方式Detailed ways

需要说明的是，本申请的原理是以实施在一适当的运算环境中来举例说明。以下的说明是基于所例示的本申请具体实施例，其不应被视为限制本申请未在此详述的其他具体实施例。It should be noted that the principles of this application are implemented in an appropriate computing environment to illustrate. The following description is based on the illustrated specific embodiments of the present application, and should not be regarded as limiting other specific embodiments of the present application that are not described in detail here.

本申请以下描述中，涉及到“一些实施例”，其描述了所有可能实施例的子集，但是可以理解，“一些实施例”可以是所有可能实施例的相同子集或不同子集，并且可以在不冲突的情况下相互结合。In the following description of this application, reference is made to "some embodiments", which describe a subset of all possible embodiments, but it is understood that "some embodiments" can be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict.

本申请以下描述中，所涉及的术语“第一\第二\第三”仅仅是是区别类似的对象，不代表针对对象的特定排序，可以理解地，“第一\第二\第三”在允许的情况下可以互换特定的顺序或先后次序，以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。In the following description of this application, the terms "first\second\third" involved are only used to distinguish similar objects and do not represent a specific ordering of objects. It is understandable that "first\second\third" Where permitted, the specific order or sequence may be interchanged so that the embodiments of the application described herein can be practiced in an order other than that illustrated or described herein.

除非另有定义，本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的，不是旨在限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein are only for the purpose of describing the embodiments of the present application and are not intended to limit the present application.

为了能够提高语音识别的效果，本申请实施例提供一种语音识别方法、语音识别装置、电子设备、计算机可读存储介质以及计算机程序产品。其中，语音识别方法可由语音识别模型的训练装置执行，或者由集成了该语音识别模型的训练装置的电子设备执行。In order to improve the effect of speech recognition, embodiments of the present application provide a speech recognition method, a speech recognition device, an electronic device, a computer-readable storage medium, and a computer program product. The speech recognition method may be executed by a training device of a speech recognition model, or by an electronic device integrating the training device of a speech recognition model.

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without making creative efforts fall within the scope of protection of this application.

请参照图1，本申请还提供一语音识别系统，如图1所示，该语音识别系统包括电子设备100，电子设备100中集成有本申请提供的语音识别模型的训练装置和/或语音识别装置。Please refer to Figure 1. The present application also provides a speech recognition system. As shown in Figure 1, the speech recognition system includes an electronic device 100. The electronic device 100 integrates a training device and/or speech recognition of the speech recognition model provided by the present application. device.

其中，电子设备100可以是任何配置有处理器而具备处理能力的设备，比如智能手机、平板电脑、掌上电脑、笔记本电脑、智能音箱等具备处理器的移动式电子设备，或者台式电脑、电视、服务器、工业设备等具备处理器的固定式电子设备。The electronic device 100 may be any device equipped with a processor and capable of processing, such as a mobile electronic device equipped with a processor such as a smartphone, tablet computer, handheld computer, notebook computer, smart speaker, or a desktop computer, television, Servers, industrial equipment and other fixed electronic equipment equipped with processors.

另外，如图1所示，该语音识别系统还可以包括存储器200，用于存储待识别的语音样本。In addition, as shown in Figure 1, the speech recognition system may also include a memory 200 for storing speech samples to be recognized.

本申请实施例中，存储器200可以是云存储器，云存储(cloud storage)是在云计算概念上延伸和发展出来的一个新的概念，分布式云存储系统(以下简称存储系统)是指通过集群应用、网格技术以及分布存储文件系统等功能，将网络中大量各种不同类型的存储设备(存储设备也称之为存储节点)通过应用软件或应用接口集合起来协同工作，共同对外提供数据存储和业务访问功能的一个存储系统。In this embodiment of the present application, the memory 200 may be a cloud storage. Cloud storage is a new concept extended and developed from the concept of cloud computing. A distributed cloud storage system (hereinafter referred to as a storage system) refers to a system that uses a cluster Functions such as applications, grid technology, and distributed storage file systems bring together a large number of different types of storage devices (storage devices are also called storage nodes) in the network to work together through application software or application interfaces to jointly provide external data storage. and a storage system for business access functions.

目前，存储系统的存储方法为：创建逻辑卷，在创建逻辑卷时，就为每个逻辑卷分配物理存储空间，该物理存储空间可能是某个存储设备或者某几个存储设备的磁盘组成。客户端在某一逻辑卷上存储数据，也就是将数据存储在文件系统上，文件系统将数据分成许多部分，每一部分是一个对象，对象不仅包含数据而且还包含数据标识(ID，ID entity)等额外的信息，文件系统将每个对象分别写入该逻辑卷的物理存储空间，且文件系统会记录每个对象的存储位置信息，从而当客户端请求访问数据时，文件系统能够根据每个对象的存储位置信息让客户端对数据进行访问。Currently, the storage method of the storage system is to create logical volumes. When creating logical volumes, physical storage space is allocated to each logical volume. The physical storage space may be composed of disks of a certain storage device or several storage devices. The client stores data on a certain logical volume, that is, the data is stored on the file system. The file system divides the data into many parts. Each part is an object. The object not only contains data but also contains data identification (ID, ID entity). and other additional information, the file system writes each object to the physical storage space of the logical volume separately, and the file system records the storage location information of each object, so that when the client requests to access data, the file system can according to each The storage location information of the object allows the client to access the data.

存储系统为逻辑卷分配物理存储空间的过程，具体为：按照对存储于逻辑卷的对象的容量估量(该估量往往相对于实际要存储的对象的容量有很大余量)和独立冗余磁盘阵列(RAID，Redundant Array ofIndependent Disk)的组别，预先将物理存储空间划分成分条，一个逻辑卷可以理解为一个分条，从而为逻辑卷分配了物理存储空间。The process of the storage system allocating physical storage space to a logical volume, specifically based on the capacity estimation of the objects stored in the logical volume (this estimation often has a large margin relative to the actual capacity of the objects to be stored) and independent redundant disks The group of RAID (Redundant Array of Independent Disk) divides the physical storage space into stripes in advance. A logical volume can be understood as a stripe, thereby allocating physical storage space to the logical volume.

需要说明的是，图1所示的语音识别系统的场景示意图仅仅是一个示例，本申请实施例描述的语音识别系统以及场景是为了更加清楚的说明本申请实施例的技术方案，并不构成对于本申请实施例提供的技术方案的限定，本领域普通技术人员可知，随着语音识别系统的演变和新业务场景的出现，本申请实施例提供的技术方案对于类似的技术问题，同样适用。It should be noted that the scene diagram of the speech recognition system shown in Figure 1 is only an example. The speech recognition system and scenes described in the embodiments of the present application are for the purpose of more clearly explaining the technical solutions of the embodiments of the present application, and do not constitute a Limitations of the technical solutions provided by the embodiments of the present application. Persons of ordinary skill in the art will know that with the evolution of speech recognition systems and the emergence of new business scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

以下分别进行详细说明。需说明的是，以下实施例的序号不作为对实施例优选顺序的限定。Each is explained in detail below. It should be noted that the serial numbers of the following embodiments are not used to limit the preferred order of the embodiments.

请参照图2，图2是本申请实施例提供的语音识别方法的一个实施例的流程示意图，如图2所示，本申请提供的语音识别方法的流程如下：Please refer to Figure 2. Figure 2 is a schematic flow chart of an embodiment of the speech recognition method provided by the embodiment of the present application. As shown in Figure 2, the flow of the speech recognition method provided by the present application is as follows:

201、获取初始多任务语音识别模型。201. Obtain the initial multi-task speech recognition model.

多任务模型顾名思义即存在多个不同的任务模型同时进行模型训练和后续识别过程，多任务模型由于同时利用多个不同的模型，因此可以有效的提高模型训练和识别的效率，还可以在一定程度上缓解模型的过拟合，提高模型的泛化能力。本申请提供一种多任务语音识别模型就包括多个不同的语音识别模型，通过对该多任务语音识别模型进行训练，得到语音识别效果更好的语音识别模型。As the name suggests, the multi-task model means that there are multiple different task models that perform model training and subsequent recognition processes at the same time. The multi-task model uses multiple different models at the same time, so it can effectively improve the efficiency of model training and recognition, and can also improve the efficiency of model training and recognition to a certain extent. to alleviate the overfitting of the model and improve the generalization ability of the model. This application provides a multi-task speech recognition model that includes multiple different speech recognition models. By training the multi-task speech recognition model, a speech recognition model with better speech recognition effect can be obtained.

具体地，需要获取一个初始多任务语音识别模型，该初始多任务语音识别模型包括第一初始语音识别模型和第二初始语音识别模型。在本申请实施例中，需要同时训练第一初始语音识别模型和第二初始语音识别模型，以调整初始多任务语音识别模型的模型参数，实现对初始多任务语音识别模型的训练。具体的训练过程在后续实施例中详细说明，此处不做限定。Specifically, an initial multi-task speech recognition model needs to be obtained, and the initial multi-task speech recognition model includes a first initial speech recognition model and a second initial speech recognition model. In this embodiment of the present application, it is necessary to train the first initial speech recognition model and the second initial speech recognition model simultaneously to adjust the model parameters of the initial multi-task speech recognition model and realize the training of the initial multi-task speech recognition model. The specific training process will be described in detail in subsequent embodiments and is not limited here.

202、利用初始多任务语音识别模型识别预设语音样本序列，得到第一初始语音识别模型识别的第一语音识别结果和第二初始语音识别模型识别的第二语音识别结果。202. Use the initial multi-task speech recognition model to recognize the preset speech sample sequence, and obtain the first speech recognition result recognized by the first initial speech recognition model and the second speech recognition result recognized by the second initial speech recognition model.

203、利用第一损失函数集合中的多个损失函数计算第一语音识别结果的第一模型损失，以及利用第二损失集合中的多个损失函数计算第二语音识别结果的第二模型损失。203. Calculate the first model loss of the first speech recognition result using multiple loss functions in the first loss function set, and calculate the second model loss of the second speech recognition result using multiple loss functions in the second loss function set.

在训练初始多任务语音识别模型的过程中，首先需要利用初始多任务语音识别模型识别语音样本，主要是识别一段预设的语音样本序列，得到一个语音识别结果。具体地，需要利用第一初始语音识别模型对预设样本语音序列进行识别，得到第一语音识别结果；以及利用第二初始样本语音识别模型对预设样本语音序列进行识别，得到第二识别结果。再计算语音识别结果的损失，即计算第一语音识别结果对应的第一语音损失，以及第二语音识别结果对应的第二语音损失。In the process of training the initial multi-task speech recognition model, you first need to use the initial multi-task speech recognition model to identify speech samples, mainly to identify a preset speech sample sequence and obtain a speech recognition result. Specifically, it is necessary to use the first initial speech recognition model to recognize the preset sample speech sequence to obtain the first speech recognition result; and to use the second initial sample speech recognition model to recognize the preset sample speech sequence to obtain the second recognition result. . Then calculate the loss of the speech recognition result, that is, calculate the first speech loss corresponding to the first speech recognition result, and the second speech loss corresponding to the second speech recognition result.

在本申请的实施例中，需要分别计算第一语音损失和第二语音损失，且在计算第一语音损失和第二语音损失时，分别利用多个损失函数进行计算。具体地，利用第一损失函数集合中的多个损失函数计算第一语音识别结果的第一模型损失；以及利用第二损失函数集合中的多个损失函数计算第二语音识别结果的第二模型损失。与现有技术中仅利用单个损失函数计算模型损失不同，本申请利用多个损失函数分别计算不同语音识别模型的模型损失，可以得到更加精准的模型损失，从而更好的优化初始多任务语音识别模型。In the embodiment of the present application, the first speech loss and the second speech loss need to be calculated separately, and when calculating the first speech loss and the second speech loss, multiple loss functions are used for calculation. Specifically, using multiple loss functions in the first loss function set to calculate the first model loss of the first speech recognition result; and using multiple loss functions in the second loss function set to calculate the second model of the second speech recognition result. loss. Different from the existing technology that only uses a single loss function to calculate the model loss, this application uses multiple loss functions to calculate the model loss of different speech recognition models respectively, which can obtain a more accurate model loss, thereby better optimizing the initial multi-task speech recognition. Model.

204、利用第一模型损失和第二模型损失更新初始多任务语音识别模型，得到更新后的多任务语音识别模型。204. Use the first model loss and the second model loss to update the initial multi-task speech recognition model to obtain an updated multi-task speech recognition model.

205、利用更新后的多任务语音识别模型进行语音识别。205. Use the updated multi-task speech recognition model for speech recognition.

在分别计算得到第一模型损失和第二模型损失后，可以同时利用第一模型损失和第二模型损失更新初始多任务语音识别模型，得到更新后的多任务语音识别模型。重新判断更新后的多任务语音识别模型是否达到模型要求，主要是判断更新后的多任务语音识别模型的模型损失是否满足目标损失标准，或是盘判断更新后的多任务语音识别模型是否已经收敛。若是满足目标损失标准，则说明更新后的多任务语音识别模型已经达到要求，可以停止训练；并利用此时更新后的多任务语音识别模型进行语音识别。若是未满足目标损失标准，则需要根据新的模型损失重新调整多任务语音识别模型的模型参数，得到新的多任务语音识别模型。After the first model loss and the second model loss are respectively calculated, the initial multi-task speech recognition model can be updated using the first model loss and the second model loss at the same time to obtain an updated multi-task speech recognition model. Re-judge whether the updated multi-task speech recognition model meets the model requirements, mainly to judge whether the model loss of the updated multi-task speech recognition model meets the target loss standard, or to judge whether the updated multi-task speech recognition model has converged . If the target loss standard is met, it means that the updated multi-task speech recognition model has met the requirements, and the training can be stopped; and the updated multi-task speech recognition model can be used for speech recognition. If the target loss standard is not met, the model parameters of the multi-task speech recognition model need to be readjusted based on the new model loss to obtain a new multi-task speech recognition model.

本申请实施例提供一种语音识别方法，需要利用包括第一初始语音识别模型和第二初始语音识别模型的初始多任务语音识别模型进行模型训练，同时在训练过程中利用多个不同的损失函数计算模型损失，利用多个模型损失同时更新初始多任务语音识别模型中的模型参数，得到训练后多任务语音识别模型。本申请实施例提供包括多个语音识别模型多的多任务语音识别模型，同时整合多个损失函数共同增强多任务语音识别模型的语音识别精度。Embodiments of the present application provide a speech recognition method that requires the use of an initial multi-task speech recognition model including a first initial speech recognition model and a second initial speech recognition model for model training, and at the same time utilizes multiple different loss functions during the training process. Calculate the model loss, use multiple model losses to simultaneously update the model parameters in the initial multi-task speech recognition model, and obtain the trained multi-task speech recognition model. Embodiments of the present application provide a multi-task speech recognition model including multiple speech recognition models, and simultaneously integrate multiple loss functions to jointly enhance the speech recognition accuracy of the multi-task speech recognition model.

如图3所示，为本申请实施例提供的得到第一语音识别结果和第二语音识别结果的一个实施例流程示意图，主要包括如下步骤：As shown in Figure 3, a schematic flow chart of an embodiment of obtaining the first speech recognition result and the second speech recognition result provided by the embodiment of the present application mainly includes the following steps:

301、利用第一语音识别模块识别预设语音样本序列，得到语音转录文本。301. Use the first speech recognition module to recognize the preset speech sample sequence and obtain the speech transcription text.

302、利用第二语音识别模块识别预设语音样本序列，得到第一翻译文本。302. Use the second speech recognition module to recognize the preset speech sample sequence to obtain the first translation text.

303、利用第三语音识别模块识别预设语音样本序列，得到第二翻译文本。303. Use the third speech recognition module to recognize the preset speech sample sequence and obtain the second translation text.

304、利用第四语音识别模块识别语音转录文本，得到第三翻译文本。304. Use the fourth speech recognition module to recognize the speech transcription text and obtain the third translated text.

其中，第一语音识别结果包括语音转录文本和第一翻译文本；第二语音识别结果包括第二翻译文本和第三翻译文本。The first speech recognition result includes a speech transcription text and a first translated text; the second speech recognition result includes a second translated text and a third translated text.

如图4所示，为本申请实施例提供的初始语音识别模型的一个实施例架构示意图；以下结合3和图4对语音识别过程进行说明。在图4所示的实施例中，初始多任务语音识别模型包括图中左侧的第一初始语音识别模型，以及位于右侧的第二初始语音识别模型；第一初始语音识别模型和第二初始语音识别模型各自拥有多个不同的功能模块。对于左侧的第一初始语音识别模型来说，主要包括语音编码器、第一共享编解码器、第一语音识别模块和第二语音识别模块；对于第二初始语音识别模型来说，主要包括语音编解码器、第二共享编解码器、第三语音识别模块和第四语音识别模块。其中，第一初始语音识别模块和第二初始语音识别模块共享同一个语音编码器；即图4所示的模型架构中实质上只包括一个语音编码器。而第一初始语音识别模型和第二语音识别模型各自拥有一个共享编解码器；第一共享编解码器和第二共享编解码器的结构相同，但两者的参数不同以分别适用于第一初始语音编解码器和第二初始语音编解码器。As shown in Figure 4, it is a schematic architectural diagram of an embodiment of the initial speech recognition model provided by the embodiment of the present application; the speech recognition process will be described below in conjunction with Figure 3 and Figure 4. In the embodiment shown in Figure 4, the initial multi-task speech recognition model includes a first initial speech recognition model on the left side of the figure, and a second initial speech recognition model on the right side; the first initial speech recognition model and the second initial speech recognition model. The initial speech recognition models each have multiple different functional modules. For the first initial speech recognition model on the left, it mainly includes a speech encoder, a first shared codec, a first speech recognition module and a second speech recognition module; for the second initial speech recognition model, it mainly includes a speech codec, a second shared codec, a third speech recognition module and a fourth speech recognition module. Among them, the first initial speech recognition module and the second initial speech recognition module share the same speech coder; that is, the model architecture shown in Figure 4 actually includes only one speech coder. The first initial speech recognition model and the second speech recognition model each have a shared codec; the structures of the first shared codec and the second shared codec are the same, but their parameters are different so that they are respectively applicable to the first Initial speech codec and second initial speech codec.

在上述实施例中，语音编码器中主要包括一个语音预训练模型用以识别语音序列，语音编码器可以提取语音序列中的特征同时对语音序列的长度进行压缩。通常来说，语音序列的长度远大于文本序列的长度，这会消耗大量的计算资源；因此通过将获取到的语音序列输入语音编码器中，利用语音编码器提取语音特征并压缩语音序列的长度来减小计算资源的消耗。在本申请的实施例中，语音编码器中不仅包括一个语音预训练模型，还可以在该语音预训练模型的基础上进一步串联多个卷积神经网络(Convolutional NeuralNetworks，CNN)层，如两层CNN层。串联的多个CNN层可以加强语音预训练模型识别出的语音特征的非线性表示能力，便于后续对语音特征的识别。In the above embodiment, the speech encoder mainly includes a speech pre-training model to recognize the speech sequence. The speech encoder can extract features in the speech sequence and compress the length of the speech sequence. Generally speaking, the length of the speech sequence is much longer than the length of the text sequence, which consumes a lot of computing resources; therefore, by inputting the obtained speech sequence into the speech encoder, the speech encoder is used to extract speech features and compress the length of the speech sequence. to reduce the consumption of computing resources. In the embodiment of the present application, the speech encoder not only includes a speech pre-training model, but also can further connect multiple convolutional neural network (Convolutional Neural Networks, CNN) layers in series based on the speech pre-training model, such as two layers CNN layer. Multiple CNN layers connected in series can enhance the nonlinear representation ability of speech features recognized by the speech pre-training model and facilitate subsequent recognition of speech features.

前述实施例中，第一初始语音识别模型和第二语音识别模型共用同一个语音编码器；在此基础上，在另一些实施例中两者还可以共用同一个共享编解码器，即多任务语音识别模型中仅包括一个语音编码器和一个共享编解码器。从而进一步提高第一语音识别模型和第二语音识别模型之间的联系，提高多任务语音识别模型的精度；且降低计算资源的消耗。In the foregoing embodiments, the first initial speech recognition model and the second speech recognition model share the same speech coder; on this basis, in other embodiments, the two can also share the same shared codec, that is, multi-tasking Only a speech encoder and a shared codec are included in the speech recognition model. Thereby, the connection between the first speech recognition model and the second speech recognition model is further improved, the accuracy of the multi-task speech recognition model is improved, and the consumption of computing resources is reduced.

本申请实施例提供的语音编码器的输入为原始的待识别的语音序列，输出则为具有声学特征的语音表示序列。由于第一初始语音识别模型和第二初始语音识别模型共享同一个语音编码器，因此语音编码器输出的语音表示序列会分别输入第一初始语音识别模型和第二初始语音识别模型，并进行后续的语音识别过程。The input of the speech encoder provided by the embodiment of the present application is the original speech sequence to be recognized, and the output is a speech representation sequence with acoustic characteristics. Since the first initial speech recognition model and the second initial speech recognition model share the same speech encoder, the speech representation sequence output by the speech encoder will be input into the first initial speech recognition model and the second initial speech recognition model respectively, and the subsequent speech recognition process.

对于第一初始语音识别模型来说，语音编码器输出的语音表示序列会继续输入至第一共享编解码器中，同时结合第一语音识别模块和第二语音识别模块，分别得到语音转录文本和第一翻译文本。以语音表示序列为A语言表示序列为例，第一语音识别模块会对A语言表示序列进行语音转录，得到A语言的语音转录文本。而第二语音识别模块不仅将语音转录为文本，最终得到是不同于A语言的B语言的第一翻译文本；即第二语音识别模块不仅进行语言转录，还进行了语言翻译。虽然第一语音识别模块和第二语音识别模块最终输出的都是文本，但语音转录文本的语言与输入的语言相同，均为A语言，第一翻译文本的语言与输入的语言不同，为B语言。本申请实施例中，第一语音识别模块和第二语音识别模块的输入相同，但输出不同。For the first initial speech recognition model, the speech representation sequence output by the speech encoder will continue to be input into the first shared codec. At the same time, combined with the first speech recognition module and the second speech recognition module, the speech transcript text and the speech representation sequence are obtained respectively. First translated text. Taking the speech representation sequence as a language A representation sequence as an example, the first speech recognition module will perform speech transcription of the language A representation sequence to obtain a speech transcription text of language A. The second speech recognition module not only transcribes speech into text, but ultimately obtains the first translated text in language B that is different from language A; that is, the second speech recognition module not only performs language transcription, but also performs language translation. Although the first speech recognition module and the second speech recognition module finally output text, the language of the speech transcription text is the same as the input language, which is language A, and the language of the first translated text is different from the input language, which is language B. language. In this embodiment of the present application, the inputs of the first speech recognition module and the second speech recognition module are the same, but the outputs are different.

对于第二初始语音识别模型来说，不仅包括语音编解码、第二共享编解码器，还包括一个文本嵌入模块。其中，第一初始语音识别模型识别得到的语音转录文本需要作为第二初始语音识别模型的输入。具体地，第二初始语音识别模型中得到语音编码器正常识别预设语音样本序列，得到语音表示序列；而语音转录文本会输入至文本嵌入模块中，文本嵌入模块会对语音转录文本进行嵌入处理，变为对应的数学表示；即转换为第二初始语音识别模型可以识别的数据。具体的转换过程可以参考现有技术，此处不做限定。For the second initial speech recognition model, it includes not only the speech codec, the second shared codec, but also a text embedding module. Among them, the speech transcription text recognized by the first initial speech recognition model needs to be used as the input of the second initial speech recognition model. Specifically, in the second initial speech recognition model, the speech encoder normally recognizes the preset speech sample sequence and obtains the speech representation sequence; and the speech transcription text will be input into the text embedding module, and the text embedding module will embed the speech transcription text. , into the corresponding mathematical representation; that is, converted into data that can be recognized by the second initial speech recognition model. The specific conversion process may refer to the existing technology and is not limited here.

第二初始语音识别模型中还包括第三语音识别模块和第四语音识别模块；其中，共享编解码器和第三语音识别模块共同识别语音表示序列，得到第二翻译文本；共享编解码器和第四语音识别模块共同识别经过文本嵌入后的语音转录文本，得到第三翻译文本。以语音表示序列为A语言的表示序列为例，前述实施例中确定了语音转录文本为A语言的语音转录文本；而第二翻译文本则为B语言的第二翻译文本，即第三语音识别模块不仅将语音转为文本，还对语音进行了翻译；而第四语音识别模块对A语言转录文本进行识别，得到了B语言的第三翻译文本。即在本申请的实施例中，第三语音识别模块和第四语音识别模块的输入不同，但最终的输出相同，均为B语言的翻译文本。The second initial speech recognition model also includes a third speech recognition module and a fourth speech recognition module; wherein the shared codec and the third speech recognition module jointly recognize the speech representation sequence to obtain the second translated text; the shared codec and The fourth speech recognition module jointly recognizes the speech transcribed text after text embedding to obtain a third translated text. Taking the speech representation sequence as the representation sequence of language A as an example, in the foregoing embodiment, it is determined that the speech transcription text is the speech transcription text of language A; and the second translation text is the second translation text of language B, that is, the third speech recognition The module not only converts speech into text, but also translates the speech; while the fourth speech recognition module recognizes the transcribed text in language A and obtains the third translated text in language B. That is to say, in the embodiment of the present application, the inputs of the third speech recognition module and the fourth speech recognition module are different, but the final output is the same, which is the translated text of language B.

需要说明的是，上述实施例中描述的第三语音识别模块和第四语音识别模块的输出相同，是指两者都输出B语言的翻译文本；并不代表实际输出的翻译文本的具体内容也相同。同时第一初始语音识别模型和第二初始语音识别模型共享同一个语音编码器，可以减少计算消耗；而第一初始语音识别模型识别得到的语音转录文本作为第二初始语音识别模型的输入，可以促进不同语音识别模型之间的知识传递，进一步降低模型训练成本。It should be noted that the outputs of the third speech recognition module and the fourth speech recognition module described in the above embodiment are the same, which means that both output the translation text in language B; it does not mean that the specific content of the translation text actually output is also same. At the same time, the first initial speech recognition model and the second initial speech recognition model share the same speech encoder, which can reduce computing consumption; and the speech transcription text recognized by the first initial speech recognition model is used as the input of the second initial speech recognition model, which can Promote knowledge transfer between different speech recognition models and further reduce model training costs.

上述实施例中利用初始多任务语音识别模块识别预设语音样本序列，得到了识别结果，还需要利用识别结果来更新初始多任务语音识别模型，实现对初始多任务语音识别模型的训练。不同于现有技术仅利用一个或两个损失函数来训练初始多任务语音识别模型，本申请利用多于两个的损失函数来训练初始多任务语音识别模型。具体地，本申请利用包括多个损失函数的第一损失函数集合来训练第一初始语音识别模型，以及利用包括多个损失函数的第二损失函数集合来训练第二初始语音识别模型。In the above embodiment, the initial multi-task speech recognition module is used to recognize the preset speech sample sequence and obtain the recognition results. The recognition results also need to be used to update the initial multi-task speech recognition model to implement training of the initial multi-task speech recognition model. Different from the existing technology that only uses one or two loss functions to train the initial multi-task speech recognition model, this application uses more than two loss functions to train the initial multi-task speech recognition model. Specifically, this application uses a first loss function set including multiple loss functions to train a first initial speech recognition model, and uses a second loss function set including multiple loss functions to train a second initial speech recognition model.

在一个具体实施例中，第一损失函数集中包括第一损失函数、第二损失函数、第三损失函数和第四损失函数；第二损失函数集合中包括第五损失函数、第六损失函数、第七损失函数和第八损失函数。如图5所示，此时计算第一模型损失和第二模型损失的步骤可以包括：In a specific embodiment, the first loss function set includes a first loss function, a second loss function, a third loss function and a fourth loss function; the second loss function set includes a fifth loss function, a sixth loss function, The seventh loss function and the eighth loss function. As shown in Figure 5, the steps for calculating the first model loss and the second model loss at this time may include:

501、利用第一损失函数计算语音转录文本的第三模型损失。501. Use the first loss function to calculate the third model loss of the speech transcription text.

502、利用第二损失函数计算第一翻译文本的第四模型损失。502. Use the second loss function to calculate the fourth model loss of the first translated text.

503、以第一初始语音识别模型为教师模型，第二初始语音识别模型为学生模型，利用第三损失函数计算第一初始语音识别模型和第二初始语音识别模型之间的第五模型损失。503. Using the first initial speech recognition model as the teacher model and the second initial speech recognition model as the student model, use the third loss function to calculate the fifth model loss between the first initial speech recognition model and the second initial speech recognition model.

504、利用第四损失函数计算语音转录文本和第一翻译文本之间的第六模型损失。504. Use the fourth loss function to calculate the sixth model loss between the speech transcription text and the first translation text.

505、利用第五损失函数计算第二翻译文本的第七模型损失。505. Use the fifth loss function to calculate the seventh model loss of the second translated text.

506、利用第六损失函数计算第三翻译文本的第八模型损失。506. Use the sixth loss function to calculate the eighth model loss of the third translation text.

507、以第一初始语音识别模型为学生模型，第二初始语音识别模型为教师模型，利用第七损失函数计算第一初始语音识别模型和第二初始语音识别模型之间的第九模型损失。507. Using the first initial speech recognition model as the student model and the second initial speech recognition model as the teacher model, use the seventh loss function to calculate the ninth model loss between the first initial speech recognition model and the second initial speech recognition model.

508、利用第八损失函数计算第二翻译文本和第三翻译文本之间的第十模型损失。508. Use the eighth loss function to calculate the tenth model loss between the second translation text and the third translation text.

具体地，本申请提供的初始多任务语音识别模型得到了四个输出，分别为语音转录文本、第一翻译文本、第二翻译文本和第三翻译文本；因此需要分别计算四个输出结果各自对应的损失。具体包括利用第一损失函数计算语音转录文本的第三模型损失，利用第二损失函数计算第一翻译文本的第四模型损失，利用第五损失函数计算第二翻译文本的第七模型损失，以及利用第六损失函数计算第三翻译文本的第八模型损失。Specifically, the initial multi-task speech recognition model provided by this application obtained four outputs, which are the speech transcription text, the first translated text, the second translated text and the third translated text; therefore, it is necessary to calculate the corresponding four output results respectively. Loss. Specifically, the method includes using the first loss function to calculate the third model loss of the speech transcription text, using the second loss function to calculate the fourth model loss of the first translated text, using the fifth loss function to calculate the seventh model loss of the second translated text, and The eighth model loss of the third translated text is calculated using the sixth loss function.

同时由于初始多任务语音识别模型中包括第一初始语音识别模型和第二初始语音识别模型，且第一初始语音识别模型的输出结果作为第二初始语音识别模型的输入；即第一初始语音识别模型和第二初始语音识别模型之间是互相影响的。因此本申请实施例中还需要计算互相影响的第一初始语音识别模型和第二初始语音识别模型，在进行语音识别时造成的模型损失。具体地，可以先将第一初始语音识别模型为教师模型，第二初始语音识别模型为学生模型，并利用第三损失函数计算第一初始语音识别模型和第二初始语音识别模型之间的第五模型损失。以及将第一初始语音识别模型作为学生模型，将第二初始语音识别模型作为教师模型，利用第七损失函数计算第一初始语音识别模型和第二初始语音识别模型之间的第九模型损失。本申请额外引入了第三损失函数和第七损失函数来计算：第一初始语音识别模型和第二初始语音识别模型之间的模型损失。这样可以将不同模型之间的互相影响考虑进去，更好地训练多任务语音识别模型。At the same time, because the initial multi-task speech recognition model includes a first initial speech recognition model and a second initial speech recognition model, and the output result of the first initial speech recognition model is used as the input of the second initial speech recognition model; that is, the first initial speech recognition model The model and the second initial speech recognition model influence each other. Therefore, in the embodiment of the present application, it is also necessary to calculate the first initial speech recognition model and the second initial speech recognition model that interact with each other, and the model loss caused during speech recognition. Specifically, the first initial speech recognition model can be the teacher model, the second initial speech recognition model can be the student model, and the third loss function can be used to calculate the third loss between the first initial speech recognition model and the second initial speech recognition model. Five model losses. and using the first initial speech recognition model as the student model, using the second initial speech recognition model as the teacher model, and using the seventh loss function to calculate the ninth model loss between the first initial speech recognition model and the second initial speech recognition model. This application additionally introduces a third loss function and a seventh loss function to calculate: the model loss between the first initial speech recognition model and the second initial speech recognition model. This can take into account the interaction between different models and better train multi-task speech recognition models.

前述实施例分别描述了第三模型损失、第四模型损失、第五模型损失、第七模型损失、第八模型损失和第九模型损失；而本申请实施例中计算的模型损失还包括第六模型损失和第十模型损失。具体地，由于第一初始语音识别模型中包括的第一语音识别模块和第二语音识别模块，分别输出语音转录文本和第一翻译文本，具体为输出的不同语言的文本。因此语音转录文本和第一翻译文本可以在同一个语音表示空间内；而为了维持语音转录文本和第一翻译文本在语义表示空间上的一致性，还需要计算语音转录文本和第一翻译文本之间的第六模型损失。通常来说，第六模型损失越小代表第一语音识别模块和第二语音识别模块的语音识别效果更好，即多任务语音识别模型的性能更好。因此本申请中，可以引入第四损失函数，并在语义表示空间中对第一语音识别模块和第二语音识别模块进行正则化。具体的计算过程可以参考现有技术，此处不做限定。The foregoing embodiments respectively describe the third model loss, the fourth model loss, the fifth model loss, the seventh model loss, the eighth model loss and the ninth model loss; and the model loss calculated in the embodiment of the present application also includes the sixth model loss. model loss and tenth model loss. Specifically, due to the first speech recognition module and the second speech recognition module included in the first initial speech recognition model, the speech transcription text and the first translation text are respectively output, specifically the output texts in different languages. Therefore, the speech transcription text and the first translation text can be in the same speech representation space; and in order to maintain the consistency of the speech transcription text and the first translation text in the semantic representation space, it is also necessary to calculate the difference between the speech transcription text and the first translation text. The sixth model loss between. Generally speaking, the smaller the loss of the sixth model, the better the speech recognition effect of the first speech recognition module and the second speech recognition module, that is, the better the performance of the multi-task speech recognition model. Therefore, in this application, a fourth loss function can be introduced and the first speech recognition module and the second speech recognition module can be regularized in the semantic representation space. The specific calculation process can refer to the existing technology and is not limited here.

同理对于第二初始语音识别模型来说，第三语音识别模块和第四语音识别模块的输出均为相同语言的翻译文本，因此两者的输出共享同一语言的输出空间。为了维持第二翻译文本和第三翻译文本之间的一致性，需要在同一语言的输出空间内，需要计算第三语音识别模块和第四语音识别模块之间第十模型损失。具体地，可以引入第八损失函数，在同一语言的输出空间内，对第三语音识别模块和第四语音识别模块进行正则化。具体的计算过程可以参考现有技术，此处不做限定。Similarly, for the second initial speech recognition model, the outputs of the third speech recognition module and the fourth speech recognition module are both translated texts in the same language, so the outputs of the two share the output space of the same language. In order to maintain the consistency between the second translated text and the third translated text, it is necessary to calculate the tenth model loss between the third speech recognition module and the fourth speech recognition module within the output space of the same language. Specifically, an eighth loss function can be introduced to regularize the third speech recognition module and the fourth speech recognition module within the output space of the same language. The specific calculation process can refer to the existing technology and is not limited here.

需要说明的是，上述实施例中，第三损失函数和第七损失函数的具体函数可以相同；但分别将第一初始语音识别模型和第二初始语音识别模型作为教师模型，利用相同的损失函数计算得到的模型损失是不同的。在一个具体实施例中，第四损失函数可以为均方误差(Mean Square Error，MSE)损失函数；第八损失函数可以为散度(Kullback-Leibler，KL)损失函数。其他损失函数可以根据实际需求选择，此处不做限定。同时，前述实施例中的正则化是一种硬约束，在其他实施例中也可以对不同的语音识别模块进行软约束，也可以保证翻译文本之间得到一致性。具体的计算过程可以参考现有技术，此处不做限定。It should be noted that in the above embodiments, the specific functions of the third loss function and the seventh loss function may be the same; but the first initial speech recognition model and the second initial speech recognition model are respectively used as teacher models, and the same loss function is used. The calculated model loss is different. In a specific embodiment, the fourth loss function may be a mean square error (MSE) loss function; the eighth loss function may be a divergence (Kullback-Leibler, KL) loss function. Other loss functions can be selected according to actual needs and are not limited here. At the same time, the regularization in the foregoing embodiment is a hard constraint. In other embodiments, soft constraints can also be applied to different speech recognition modules to ensure consistency between translated texts. The specific calculation process can refer to the existing technology and is not limited here.

如图6所示，为本申请实施例提供的得到更新后的多任务语音识别模型的实施例流程示意图，可以包括如下步骤：As shown in Figure 6, a schematic flow chart of an embodiment of obtaining an updated multi-task speech recognition model provided by an embodiment of the present application may include the following steps:

601、根据第一模型损失计算第一初始语音识别模型中的第一梯度参数。601. Calculate the first gradient parameter in the first initial speech recognition model according to the first model loss.

602、根据第二模型损失计算第二初始语音识别模型中的第二梯度参数。602. Calculate the second gradient parameter in the second initial speech recognition model according to the second model loss.

603、根据第一梯度参数和第二梯度参数同时更新初始多任务语音识别模型的模型参数，得到更新后的多任务语音识别模型。603. Simultaneously update the model parameters of the initial multi-task speech recognition model according to the first gradient parameter and the second gradient parameter to obtain an updated multi-task speech recognition model.

在本申请实施例中，利用损失函数计算得到模型损失后，就可以利用模型损失来更新初始多任务语音识别模型。具体地，本申请可以分别利用第一模型损失反向计算第一初始语音识别模型中的第一梯度参数，以及利用第二模型损失计算第二初始语音识别模型中的第二梯度参数；但由于第一初始语音识别模型和第二初始语音识别模型共享一个语音编码器，且第一初始语音识别模型的输出结果作为第二初始语音识别模型的输入。因此本申请中需要同时利用第一梯度参数和第二梯度参数来更新初始多任务语音识别模型中的模型参数，而不能单独的利用第一梯度参数更新第一初始语音识别模型，或是单独利用第二梯度参数更新第二初始语音识别模型。计算模型梯度和更新模型参数的具体过程可以参考现有技术，此处不做限定。In the embodiment of the present application, after the model loss is calculated using the loss function, the model loss can be used to update the initial multi-task speech recognition model. Specifically, this application can respectively use the first model loss to reversely calculate the first gradient parameter in the first initial speech recognition model, and use the second model loss to calculate the second gradient parameter in the second initial speech recognition model; but because The first initial speech recognition model and the second initial speech recognition model share a speech encoder, and the output result of the first initial speech recognition model is used as the input of the second initial speech recognition model. Therefore, in this application, it is necessary to use the first gradient parameter and the second gradient parameter to update the model parameters in the initial multi-task speech recognition model at the same time. The first gradient parameter cannot be used alone to update the first initial speech recognition model, or it cannot be used alone. The second gradient parameters update the second initial speech recognition model. The specific process of calculating the model gradient and updating the model parameters can refer to the existing technology, and is not limited here.

在得到更新后的多任务语音识别模型后，还需要判断更新后的多任务语音识别模型是否已经满足目标语音识别模型标准，例如判断更新后的多任务语音识别模型是否已经收敛。若是更新后的多任务语音识别模型已经收敛，可以确认此时的语音识别模型已经达到要求，可以停止训练；若是更新后的多任务语音识别模型仍未收敛，需要继续对多任务语音模型进行训练，直至最终的多任务语音识别模型收敛或达到其他的语音识别模型标准。After obtaining the updated multi-task speech recognition model, it is also necessary to determine whether the updated multi-task speech recognition model has met the target speech recognition model standard, for example, to determine whether the updated multi-task speech recognition model has converged. If the updated multi-task speech recognition model has converged, it can be confirmed that the speech recognition model at this time has met the requirements, and the training can be stopped; if the updated multi-task speech recognition model has not converged, you need to continue training the multi-task speech model. , until the final multi-task speech recognition model converges or reaches other speech recognition model standards.

而在得到最终的满足目标语音识别模型标准的目标多任务语音识别模型后，可以利用目标多任务语音识别进行语音识别。如图7所示，为本申请实施例提供的语音识别的一个实施例流程示意图，包括：After obtaining the final target multi-task speech recognition model that meets the target speech recognition model standards, the target multi-task speech recognition can be used for speech recognition. As shown in Figure 7, a schematic flow diagram of an embodiment of speech recognition provided by the embodiment of the present application includes:

701、获取待识别的目标语音样本。701. Obtain the target speech sample to be recognized.

702、利用第二语音识别模型识别目标语音样本，得到目标语音样本对应的目标语音表示序列。702. Use the second speech recognition model to recognize the target speech sample, and obtain the target speech representation sequence corresponding to the target speech sample.

703、利用第二语音识别模型确定目标语音表示序列对应的目标语音文本的预测分布。703. Use the second speech recognition model to determine the predicted distribution of the target speech text corresponding to the target speech representation sequence.

704、对目标语音文本的预测分布进行映射，得到目标语音样本的目标翻译文本。704. Map the predicted distribution of the target speech text to obtain the target translation text of the target speech sample.

在实际进行语音识别的过程，仅利用训练后的第二语音识别模型来识别待识别的目标语音样本，而无需利用第一语音识别模型来辅助识别。具体地，可以利用第二语音样本识别模型中的语音编码器识别目标语音样本，得到目标语音样本对应的目标语音表示序列；接着将目标语音表示序列输入至第二共享编码器中。第二共享编解码器与第四语音识别模块共同识别目标语音表示序列，得到目标语音表示序列对应的目标语音文本的预测分布，并进一步将目标语音文本的预测分布进行映射，得到目标语音样本的目标翻译文本。其中，目标翻译文本与目标语音样本的语言不同，第二语音识别模型将目标语音样本翻译为不同语言的翻译文本。In the actual speech recognition process, only the trained second speech recognition model is used to identify the target speech sample to be recognized, without using the first speech recognition model to assist the recognition. Specifically, the speech encoder in the second speech sample recognition model can be used to identify the target speech sample and obtain the target speech representation sequence corresponding to the target speech sample; and then the target speech representation sequence is input into the second shared encoder. The second shared codec and the fourth speech recognition module jointly recognize the target speech representation sequence, obtain the predicted distribution of the target speech text corresponding to the target speech representation sequence, and further map the predicted distribution of the target speech text to obtain the target speech sample. Target translation text. Wherein, the target translation text and the target speech sample are in different languages, and the second speech recognition model translates the target speech sample into translation texts in different languages.

在上述实施例中，本申请主要是对现有多任务语音识别模型进行了改进，因此本申请提供的多任务语音识别模型与现有的语音识别模型相比，第二语音识别模块的参数和解码速度，与现有的语音识别模型中的第四语音识别模块的参数和解码速度保持一致；因此不会增加额外的推理开销，提高语音识别精度的同时也不会增加很多计算资源。In the above embodiments, this application mainly improves the existing multi-task speech recognition model. Therefore, compared with the existing speech recognition model, the multi-task speech recognition model provided by this application has the parameters of the second speech recognition module and The decoding speed is consistent with the parameters and decoding speed of the fourth speech recognition module in the existing speech recognition model; therefore, no additional reasoning overhead will be added, and while the speech recognition accuracy is improved, a lot of computing resources will not be added.

如下表所示，展示了本申请提供的多任务语音识别模型相较于现有的语音识别模型，识别精度均得到了提升。其中，MTL-Baseline即为前述实施例描述的现有技术中的多任务语音识别模型，ST-MKD-ASR和ST-MKD-MT分别本申请提供的第一语音识别模型和第二语音识别模型。Wav2vec 2.0、HuBERT和WavLM为不同的语音预训练模型，用于识别语音序列；ConST(SOTA)为Wav2vec 2.0改动后的实际语音预训练模块，CRESS为HuBERT改动后的实际语音预训练模型。在表中可以就看出，相较于现有技术中的MTL-Baseline模型，本申请提供的ST-MKD-ASR和ST-MKD-MT的语音识别精度均有所提升。同时若是将WavLM作为语音预训练模型，可以进一步提高语音识别精度。As shown in the table below, the multi-task speech recognition model provided by this application has improved recognition accuracy compared with existing speech recognition models. Among them, MTL-Baseline is the multi-task speech recognition model in the prior art described in the aforementioned embodiments, and ST-MKD-ASR and ST-MKD-MT are respectively the first speech recognition model and the second speech recognition model provided by this application. . Wav2vec 2.0, HuBERT and WavLM are different speech pre-training models for recognizing speech sequences; ConST (SOTA) is the actual speech pre-training module after the modification of Wav2vec 2.0, and CRESS is the actual speech pre-training model after the modification of HuBERT. It can be seen from the table that compared with the MTL-Baseline model in the prior art, the speech recognition accuracy of ST-MKD-ASR and ST-MKD-MT provided by this application has been improved. At the same time, if WavLM is used as a speech pre-training model, the speech recognition accuracy can be further improved.

同时，本申请实施例中分别将第一语音识别模型和第二语音识别模型作为教师模型计算模型损失，相较于现有技术中仅计算一次模型损失来说，也有效提高了最终多任务语音识别模型的精度。如下表所示，MKD为双向知识蒸馏确定的语音识别精度，即分别将第一语音识别模型和第二语音识别模型作为教师模型计算模型损失后得到的语音识别精度；而KD为单向知识蒸馏确定的语音识别精度，即只第一语音识别模型或第二语音识别模型作为教师模型计算模型损失后得到的语音识别精度。At the same time, in the embodiment of the present application, the first speech recognition model and the second speech recognition model are respectively used as the teacher model to calculate the model loss. Compared with the existing technology that only calculates the model loss once, it also effectively improves the final multi-task speech quality. Recognition model accuracy. As shown in the table below, MKD is the speech recognition accuracy determined by two-way knowledge distillation, that is, the speech recognition accuracy obtained after calculating the model loss by using the first speech recognition model and the second speech recognition model as the teacher model respectively; while KD is the one-way knowledge distillation The determined speech recognition accuracy is the speech recognition accuracy obtained after calculating the model loss using only the first speech recognition model or the second speech recognition model as the teacher model.

在本申请的另一些实施例中，还可以引入额外的翻译文本来辅助语音预训练模型进行语音识别，即获取精度更高的语音预训练模型来提高语音序列的识别精度，从而提高多任务语音识别模型的识别精度。In other embodiments of the present application, additional translated text can also be introduced to assist the speech pre-training model in speech recognition, that is, a higher-accuracy speech pre-training model is obtained to improve the recognition accuracy of speech sequences, thereby improving multi-task speech. Recognition accuracy of the recognition model.

为便于更好的实施本申请实施例提供的语音识别方法，本申请实施例还提供一种基于上述语音识别方法的语音识别模型的训练装置。其中名词的含义与上述语音识别方法中相同，具体实现细节请参考以上方法实施例中的说明。In order to facilitate better implementation of the speech recognition method provided by the embodiment of the present application, the embodiment of the present application also provides a training device for a speech recognition model based on the above speech recognition method. The meaning of the nouns is the same as in the above speech recognition method. For specific implementation details, please refer to the description in the above method embodiment.

请参照图8，图8为本申请实施例提供的语音识别模型的训练装置的结构示意图，该语音识别模型的训练装置可以包括模型获取模块801、第一语音识别模块802、模型损失计算模块803、模型更新模块804和第二语音识别模块805，其中：Please refer to Figure 8. Figure 8 is a schematic structural diagram of a speech recognition model training device provided by an embodiment of the present application. The speech recognition model training device may include a model acquisition module 801, a first speech recognition module 802, and a model loss calculation module 803. , model update module 804 and second speech recognition module 805, wherein:

模型获取模块801，用于获取初始多任务语音识别模型，初始多任务语音识别模型包括第一初始语音识别模型和第二初始语音识别模型。The model acquisition module 801 is used to acquire an initial multi-task speech recognition model. The initial multi-task speech recognition model includes a first initial speech recognition model and a second initial speech recognition model.

第一语音识别模块802，用于利用初始多任务语音识别模型识别预设语音样本序列，得到第一初始语音识别模型识别的第一语音识别结果和第二初始语音识别模型识别的第二语音识别结果。The first speech recognition module 802 is used to use the initial multi-task speech recognition model to recognize the preset speech sample sequence, and obtain the first speech recognition result recognized by the first initial speech recognition model and the second speech recognition recognized by the second initial speech recognition model. result.

模型损失计算模块803，用于利用第一损失函数集合中的多个损失函数计算第一语音识别结果的第一模型损失，以及利用第二损失函数集合中的多个损失函数计算第二语音识别结果的第二模型损失。The model loss calculation module 803 is used to calculate the first model loss of the first speech recognition result using multiple loss functions in the first loss function set, and calculate the second speech recognition using multiple loss functions in the second loss function set. The resulting second model loss.

模型更新模块804，用于利用第一模型损失和第二模型损失更新初始多任务语音识别模型，得到更新后的多任务语音识别模型。The model update module 804 is used to update the initial multi-task speech recognition model using the first model loss and the second model loss to obtain an updated multi-task speech recognition model.

第二语音识别模块805，用于利用更新后的多任务语音识别模型进行语音识别。The second speech recognition module 805 is used to perform speech recognition using the updated multi-task speech recognition model.

本申请实施例提供一种语音识别装置利用包括第一初始语音识别模型和第二初始语音识别模型的初始多任务语音识别模型进行模型训练，同时在训练过程中利用多个不同的损失函数计算模型损失，利用多个模型损失同时更新初始多任务语音识别模型中的模型参数，得到训练后多任务语音识别模型。本申请实施例提供包括多个语音识别模型多的多任务语音识别模型，同时整合多个损失函数共同增强多任务语音识别模型的语音识别精度。Embodiments of the present application provide a speech recognition device that uses an initial multi-task speech recognition model including a first initial speech recognition model and a second initial speech recognition model to perform model training. At the same time, multiple different loss functions are used to calculate the model during the training process. Loss, use multiple model losses to simultaneously update the model parameters in the initial multi-task speech recognition model, and obtain the trained multi-task speech recognition model. Embodiments of the present application provide a multi-task speech recognition model including multiple speech recognition models, and simultaneously integrate multiple loss functions to jointly enhance the speech recognition accuracy of the multi-task speech recognition model.

在一些实施例中，第一初始语音识别模型包括第一语音识别模块和第二语音识别模块，第二初始语音识别模型包括第三语音识别模块和第四语音识别模块。第一语音识别模块802主要用于：In some embodiments, the first initial speech recognition model includes a first speech recognition module and a second speech recognition module, and the second initial speech recognition model includes a third speech recognition module and a fourth speech recognition module. The first speech recognition module 802 is mainly used for:

利用第一语音识别模块识别预设语音样本序列，得到语音转录文本；Use the first speech recognition module to identify the preset speech sample sequence and obtain the speech transcription text;

利用第二语音识别模块识别预设语音样本序列，得到第一翻译文本；Use the second speech recognition module to recognize the preset speech sample sequence to obtain the first translated text;

利用第三语音识别模块识别预设语音样本序列，得到第二翻译文本；Use the third speech recognition module to identify the preset speech sample sequence and obtain the second translated text;

利用第四语音识别模块识别语音转录文本，得到第三翻译文本。The fourth speech recognition module is used to recognize the speech transcription text and obtain the third translated text.

其中，第一语音识别结果包括语音转录文本和第一翻译文本，第二语音识别结果包括第二翻译文本和第三翻译文本。The first speech recognition result includes a speech transcription text and a first translated text, and the second speech recognition result includes a second translated text and a third translated text.

在一些实施例中，第一损失函数集合中包括第一损失函数、第二损失函数、第三损失函数和第四损失函数。模型损失计算模块803具体可以用于：In some embodiments, the first loss function set includes a first loss function, a second loss function, a third loss function and a fourth loss function. The model loss calculation module 803 can be specifically used for:

利用第一损失函数计算语音转录文本的第三模型损失；利用第二损失函数计算第一翻译文本的第四模型损失；Using the first loss function to calculate the third model loss of the speech transcription text; using the second loss function to calculate the fourth model loss of the first translated text;

以第一初始语音识别模型为教师模型，第二初始语音识别模型为学生模型，利用第三损失函数计算第一初始语音识别模型和第二初始语音识别模型之间的第五模型损失；Using the first initial speech recognition model as the teacher model and the second initial speech recognition model as the student model, use the third loss function to calculate the fifth model loss between the first initial speech recognition model and the second initial speech recognition model;

利用第四损失函数计算语音转录文本和第一翻译文本之间的第六模型损失；Using the fourth loss function to calculate the sixth model loss between the speech transcription text and the first translated text;

其中，第一模型损失包括第三模型损失、第四模型损失、第五模型损失和第六模型损失。Among them, the first model loss includes the third model loss, the fourth model loss, the fifth model loss and the sixth model loss.

在一些实施例中，第二损失函数集合中包括第五损失函数、第六损失函数、第七损失函数和第八损失函数。模型损失计算模块803具体可以用于：In some embodiments, the second loss function set includes a fifth loss function, a sixth loss function, a seventh loss function, and an eighth loss function. The model loss calculation module 803 can be specifically used for:

利用第五损失函数计算第二翻译文本的第七模型损失；利用第六损失函数计算第三翻译文本的第八模型损失；The fifth loss function is used to calculate the seventh model loss of the second translated text; the sixth loss function is used to calculate the eighth model loss of the third translated text;

以第一初始语音识别模型为学生模型，第二初始语音识别模型为教师模型，利用第七损失函数计算第一初始语音识别模型和第二初始语音识别模型之间的第九模型损失；Using the first initial speech recognition model as the student model and the second initial speech recognition model as the teacher model, use the seventh loss function to calculate the ninth model loss between the first initial speech recognition model and the second initial speech recognition model;

利用第八损失函数计算第二翻译文本和第三翻译文本之间的第十模型损失；Using the eighth loss function to calculate the tenth model loss between the second translated text and the third translated text;

其中，第二模型损失包括第七模型损失、第八模型损失、第九模型损失和第十模型损失。Among them, the second model loss includes the seventh model loss, the eighth model loss, the ninth model loss and the tenth model loss.

在一些实施例中，模型更新模块804具体可以用于：In some embodiments, the model update module 804 may be used to:

根据第一模型损失计算第一初始语音识别模型中的第一梯度参数；Calculate the first gradient parameter in the first initial speech recognition model based on the first model loss;

根据第二模型损失计算第二初始语音识别模型中的第二梯度参数；Calculate second gradient parameters in the second initial speech recognition model based on the second model loss;

根据第一梯度参数和第二梯度参数同时更新初始多任务语音识别模型的模型参数，得到更新后的多任务语音识别模型。The model parameters of the initial multi-task speech recognition model are simultaneously updated according to the first gradient parameter and the second gradient parameter to obtain an updated multi-task speech recognition model.

在一些实施例中，第二语音识别模块805具体可以用于：In some embodiments, the second speech recognition module 805 may be used to:

若更新后的多任务语音识别模型的模型损失满足目标损失标准，则利用更新后的多任务语音识别模型进行语音识别。If the model loss of the updated multi-task speech recognition model meets the target loss standard, the updated multi-task speech recognition model is used for speech recognition.

在一些实施例中，更新后的多任务语音识别模型包括第一语音识别模型和第二语音识别模型。第二语音识别模块805具体可以用于：In some embodiments, the updated multi-task speech recognition model includes a first speech recognition model and a second speech recognition model. The second speech recognition module 805 can be specifically used for:

利用第二语音识别模型识别目标语音样本，得到目标语音样本对应的目标语音表示序列；Use the second speech recognition model to identify the target speech sample and obtain the target speech representation sequence corresponding to the target speech sample;

利用第二语音识别模型确定目标语音表示序列对应的目标语音文本的预测分布；对目标语音文本的预测分布进行映射，得到目标语音样本的目标翻译文本。The second speech recognition model is used to determine the predicted distribution of the target speech text corresponding to the target speech representation sequence; the predicted distribution of the target speech text is mapped to obtain the target translation text of the target speech sample.

本申请实施例还提供一种电子设备，包括存储器和处理器，其中处理器通过调用存储器中存储的计算机程序，用于执行本实施例提供的语音识别方法中的步骤或者语音识别方法中的步骤。An embodiment of the present application also provides an electronic device, including a memory and a processor, wherein the processor is used to execute the steps in the speech recognition method provided by this embodiment or the steps in the speech recognition method by calling a computer program stored in the memory. .

请参照图9，图9为本申请实施例提供的电子设备的结构示意图。该电子设备可以包括一个或者一个以上处理核心的处理器901、一个或一个以上计算机可读存储介质的存储器902、电源903和输入单元904等部件。本领域技术人员可以理解，图9中示出的电子设备结构并不构成对电子设备的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。其中：Please refer to FIG. 9 , which is a schematic structural diagram of an electronic device provided by an embodiment of the present application. The electronic device may include components such as a processor 901 of one or more processing cores, a memory 902 of one or more computer-readable storage media, a power supply 903 and an input unit 904. Those skilled in the art can understand that the structure of the electronic device shown in FIG. 9 does not constitute a limitation on the electronic device, and may include more or fewer components than shown in the figure, or combine certain components, or arrange different components. in:

处理器901是该电子设备的控制中心，利用各种接口和线路连接整个电子设备的各个部分，通过运行或执行存储在存储器902内的软件程序和/或模块，以及调用存储在存储器902内的数据，执行电子设备的各种功能和处理数据。可选的，处理器901可包括一个或多个处理核心；可选的，处理器901可集成应用处理器和调制解调处理器，其中，应用处理器主要处理操作系统、用户界面和应用程序等，调制解调处理器主要处理无线通信。可以理解的是，上述调制解调处理器也可以不集成到处理器901中。The processor 901 is the control center of the electronic device, using various interfaces and lines to connect various parts of the entire electronic device, by running or executing software programs and/or modules stored in the memory 902, and calling software programs stored in the memory 902. Data, perform various functions of electronic devices and process data. Optionally, the processor 901 may include one or more processing cores; optionally, the processor 901 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface and application programs. etc., the modem processor mainly handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 901.

存储器902可用于存储软件程序以及模块，处理器901通过运行存储在存储器902的软件程序以及模块，从而执行各种功能应用以及数据处理。存储器902可主要包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等；存储数据区可存储根据电子设备的使用所创建的数据等。此外，存储器902可以包括高速随机存取存储器，还可以包括非易失性存储器，例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地，存储器902还可以包括存储器控制器，以提供处理器901对存储器902的访问。The memory 902 can be used to store software programs and modules. The processor 901 executes various functional applications and data processing by running the software programs and modules stored in the memory 902 . The memory 902 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may store data based on Data created by the use of electronic devices, etc. In addition, memory 902 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 902 may also include a memory controller to provide the processor 901 with access to the memory 902 .

电子设备还包括给各个部件供电的电源903，可选的，电源903可以通过电源管理系统与处理器901逻辑相连，从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源903还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。The electronic device also includes a power supply 903 that supplies power to various components. Optionally, the power supply 903 can be logically connected to the processor 901 through a power management system, thereby realizing functions such as charging, discharging, and power consumption management through the power management system. The power supply 903 may also include one or more DC or AC power supplies, recharging systems, power failure detection circuits, power converters or inverters, power status indicators, and other arbitrary components.

该电子设备还可包括输入单元904，该输入单元904可用于接收输入的数字或字符信息，以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。The electronic device may also include an input unit 904 that may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.

尽管未示出，该电子设备还可以包括显示单元、图像采集组件等，在此不再赘述。具体在本实施例中，电子设备中的处理器901会按照如下的指令，将一个或一个以上的计算机程序对应的可执行代码加载到存储器902中，并由处理器901来执行本申请提供的语音识别方法中的步骤，比如：Although not shown, the electronic device may also include a display unit, an image capture component, etc., which will not be described again here. Specifically, in this embodiment, the processor 901 in the electronic device will load the executable code corresponding to one or more computer programs into the memory 902 according to the following instructions, and the processor 901 will execute the instructions provided by this application. Steps in speech recognition methods, such as:

获取初始多任务语音识别模型，初始多任务语音识别模型包括第一初始语音识别模型和第二初始语音识别模型；Obtain an initial multi-task speech recognition model, which includes a first initial speech recognition model and a second initial speech recognition model;

利用初始多任务语音识别模型识别预设语音样本序列，得到第一初始语音识别模型识别的第一语音识别结果和第二初始语音识别模型识别的第二语音识别结果；Using the initial multi-task speech recognition model to identify the preset speech sample sequence, obtain the first speech recognition result recognized by the first initial speech recognition model and the second speech recognition result recognized by the second initial speech recognition model;

利用第一损失函数集合中的多个损失函数计算第一语音识别结果的第一模型损失，以及利用第二损失函数集合中的多个损失函数计算第二语音识别结果的第二模型损失；Calculating a first model loss for the first speech recognition result using a plurality of loss functions in the first loss function set, and calculating a second model loss for the second speech recognition result using a plurality of loss functions in the second loss function set;

利用第一模型损失和第二模型损失更新初始多任务语音识别模型，得到更新后的多任务语音识别模型；Update the initial multi-task speech recognition model using the first model loss and the second model loss to obtain the updated multi-task speech recognition model;

应当说明的是，本申请实施例提供的电子设备与上文实施例中的语音识别方法属于同一构思，其具体实现过程详见以上相关实施例，此处不再赘述。It should be noted that the electronic device provided by the embodiments of the present application and the speech recognition method in the above embodiments belong to the same concept. The specific implementation process can be found in the above related embodiments and will not be described again here.

本申请还提供一种计算机可读的存储介质，其上存储有计算机程序，当其存储的计算机程序在本申请实施例提供的电子设备的处理器上执行时，使得电子设备的处理器执行本申请提供的语音识别方法中的步骤。其中，存储介质可以是磁碟、光盘、只读存储器(Read Only Memory，ROM)或者随机存取器(Random Access Memory，RAM)等。The present application also provides a computer-readable storage medium on which a computer program is stored. When the stored computer program is executed on the processor of the electronic device provided by the embodiment of the present application, the processor of the electronic device is caused to execute the present invention. Apply the steps in the speech recognition method provided. The storage medium may be a magnetic disk, an optical disk, a read only memory (Read Only Memory, ROM) or a random access memory (Random Access Memory, RAM), etc.

本申请还提供一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行上述语音识别方法的各种可选实现方式。The present application also provides a computer program product or computer program, which includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes various optional implementations of the above speech recognition method.

以上对本申请所提供的一种语音识别方法、装置、电子设备及存储介质进行了详细介绍，本文中应用了具体个例对本申请的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本申请的方法及其核心思想；同时，对于本领域的技术人员，依据本申请的思想，在具体实施方式及应用范围上均会有改变之处，综上，本说明书内容不应理解为对本申请的限制。The speech recognition method, device, electronic equipment and storage medium provided by this application have been introduced in detail above. This article uses specific examples to illustrate the principles and implementation methods of this application. The description of the above embodiments is only for To help understand the methods and core ideas of this application; at the same time, for those skilled in the art, there will be changes in the specific implementation and application scope based on the ideas of this application. In summary, the contents of this specification should not be understood as a limitation on this application.

应当说明的是，当本申请以上实施例运用到具体产品或技术中时，涉及到用户的相关数据，需要获得用户许可或者同意，且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。It should be noted that when the above embodiments of this application are applied to specific products or technologies, the relevant data of the user needs to be obtained from the user's permission or consent, and the collection, use and processing of the relevant data need to comply with the laws and regulations of the relevant countries and regions. Relevant laws, regulations and standards.

Claims

1. A method of speech recognition, the method comprising:

acquiring an initial multitasking voice recognition model, wherein the initial multitasking voice recognition model comprises a first initial voice recognition model and a second initial voice recognition model;

utilizing the initial multitasking voice recognition model to recognize a preset voice sample sequence to obtain a first voice recognition result recognized by the first initial voice recognition model and a second voice recognition result recognized by the second initial voice recognition model;

calculating a first model loss of the first speech recognition result using a plurality of loss functions in a first set of loss functions, and calculating a second model loss of the second speech recognition result using a plurality of loss functions in a second set of loss functions;

updating the initial multi-task speech recognition model by using the first model loss and the second model loss to obtain an updated multi-task speech recognition model;

and performing voice recognition by using the updated multitasking voice recognition model.

2. The method of claim 1, wherein the first initial speech recognition model comprises a first speech recognition module and a second speech recognition module, and wherein the second initial speech recognition model comprises a third speech recognition module and a fourth speech recognition module;

The step of utilizing the initial multitasking speech recognition model to recognize a preset speech sample sequence to obtain a first speech recognition result corresponding to the first initial speech recognition model and a second speech recognition result corresponding to the second initial speech recognition model comprises the following steps:

recognizing the preset voice sample sequence by using the first voice recognition module to obtain a voice transcription text;

identifying the preset voice sample sequence by using the second voice identification module to obtain a first translation text;

identifying the preset voice sample sequence by using the third voice identification module to obtain a second translation text;

recognizing the voice transcribed text by using the fourth voice recognition module to obtain a third translated text;

wherein the first speech recognition result comprises the speech transcribed text and the first translated text, and the second speech recognition result comprises the second translated text and the third translated text.

3. The method according to claim 2, wherein the first set of loss functions includes a first loss function, a second loss function, a third loss function, and a fourth loss function;

The calculating a first model loss of the first speech recognition result using a first set of loss functions and calculating a second model loss of the second speech recognition result using a second set of loss functions, comprising:

calculating a third model penalty for the speech transcribed text using the first penalty function;

calculating a fourth model penalty for the first translated text using the second penalty function;

taking the first initial voice recognition model as a teacher model, the second initial voice recognition model as a student model, and calculating a fifth model loss between the first initial voice recognition model and the second initial voice recognition model by using the third loss function;

calculating a sixth model penalty between the speech transcribed text and the first translated text using the fourth penalty function;

wherein the first model loss comprises the third model loss, a fourth model loss, a fifth model loss, and a sixth model loss.

4. The method according to claim 2, wherein the second set of loss functions includes a fifth loss function, a sixth loss function, a seventh loss function, and an eighth loss function;

The calculating the first model loss of the first speech recognition result using the first set of loss functions and the calculating the second model loss of the second speech recognition result using the second set of loss functions further comprises:

calculating a seventh model penalty for the second translated text using the fifth penalty function;

calculating an eighth model penalty for the third translated text using the sixth penalty function;

taking the first initial voice recognition model as a student model, taking the second initial voice recognition model as a teacher model, and calculating a ninth model loss between the first initial voice recognition model and the second initial voice recognition model by utilizing the seventh loss function;

calculating a tenth model penalty between the second translated text and the third translated text using the eighth penalty function;

wherein the second model loss comprises the seventh model loss, eighth model loss, ninth model loss, and tenth model loss.

5. The method of claim 1, wherein updating the initial multi-tasking speech recognition model using the first model loss and the second model loss results in an updated multi-tasking speech recognition model, comprising:

Calculating a first gradient parameter in the first initial speech recognition model according to the first model loss;

calculating a second gradient parameter in the second initial speech recognition model from the second model loss;

and simultaneously updating the model parameters of the initial multi-task voice recognition model according to the first gradient parameters and the second gradient parameters to obtain an updated multi-task voice recognition model.

6. The method of claim 1, wherein the performing speech recognition using the updated multi-tasking speech recognition model comprises:

judging whether the model loss of the updated multitasking voice recognition model meets the target loss standard or not;

and if the model loss of the updated multi-task voice recognition model meets the target loss standard, performing voice recognition by using the updated multi-task voice recognition model.

7. The method of claim 1, wherein the updated multitasking speech recognition model comprises a first speech recognition model and a second speech recognition model;

if the model loss of the updated multi-task voice recognition model meets the target model loss standard, performing voice recognition by using the updated multi-task voice recognition model, including:

Acquiring a target voice sample to be identified;

identifying the target voice sample by using the second voice identification model to obtain a target voice representation sequence corresponding to the target voice sample;

determining a prediction distribution of target voice texts corresponding to the target voice representation sequences by using the second voice recognition model;

mapping the prediction distribution of the target voice text to obtain the target translation text of the target voice sample.

8. A speech recognition apparatus, comprising:

the model acquisition module is used for acquiring an initial multi-task voice recognition model, wherein the initial multi-task voice recognition model comprises a first initial voice recognition model and a second initial voice recognition model;

the first voice recognition module is used for recognizing a preset voice sample sequence by utilizing the initial multitasking voice recognition model to obtain a first voice recognition result recognized by the first initial voice recognition model and a second voice recognition result recognized by the second initial voice recognition model;

a model loss calculation module for calculating a first model loss of the first speech recognition result using a plurality of loss functions in a first set of loss functions, and calculating a second model loss of the second speech recognition result using a plurality of loss functions in a second set of loss functions;

The model updating module is used for updating the initial multi-task voice recognition model by utilizing the first model loss and the second model loss to obtain an updated multi-task voice recognition model;

and the second voice recognition module is used for carrying out voice recognition by utilizing the updated multitasking voice recognition model.

9. An electronic device comprising a memory storing a computer program and a processor for running the computer program in the memory to perform the steps of the speech recognition method of any one of claims 1 to 7.

10. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the speech recognition method of any one of claims 1 to 7.