[go: up one dir, main page]

CN115129866A - Training text generation method, model training device and electronic equipment - Google Patents

Training text generation method, model training device and electronic equipment Download PDF

Info

Publication number
CN115129866A
CN115129866A CN202210535272.6A CN202210535272A CN115129866A CN 115129866 A CN115129866 A CN 115129866A CN 202210535272 A CN202210535272 A CN 202210535272A CN 115129866 A CN115129866 A CN 115129866A
Authority
CN
China
Prior art keywords
text
training
guide
model
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210535272.6A
Other languages
Chinese (zh)
Inventor
王丽
宋有伟
张林箭
张聪
范长杰
胡志鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Hangzhou Network Co Ltd
Original Assignee
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Hangzhou Network Co Ltd filed Critical Netease Hangzhou Network Co Ltd
Priority to CN202210535272.6A priority Critical patent/CN115129866A/en
Publication of CN115129866A publication Critical patent/CN115129866A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Human Computer Interaction (AREA)
  • Machine Translation (AREA)

Abstract

本申请公开了一种训练文本生成方法、模型训练方法、文本识别方法、装置电子设备及计算机可读存储介质,其中,训练文本用于对待训练模型进行训练,以得到文本识别模型,训练文本生成方法包括:获取引导文本,所述引导文本与目标文本的语义属性相一致,所述目标文本为所述文本识别模型识别出的正例文本;将所述引导文本输入基于引导的文本生成模型中,得到与所述引导文本的语义属性相一致的输出文本;根据所述输出文本确定训练文本。本申请通过基于引导的文本生成模型自动生成输出文本,从而确定出训练文本,可以更加快速、高效地得到训练文本。

Figure 202210535272

The present application discloses a training text generation method, a model training method, a text recognition method, a device electronic device, and a computer-readable storage medium, wherein the training text is used for training a to-be-trained model to obtain a text recognition model, and the training text generates The method includes: acquiring guide text, the guide text is consistent with the semantic attributes of the target text, and the target text is the positive example text recognized by the text recognition model; inputting the guide text into a guide-based text generation model , obtain output text consistent with the semantic attributes of the guide text; determine training text according to the output text. In the present application, the output text is automatically generated by the guided text generation model, so that the training text is determined, and the training text can be obtained more quickly and efficiently.

Figure 202210535272

Description

训练文本生成方法、模型训练方法、装置及电子设备Training text generation method, model training method, device and electronic device

技术领域technical field

本申请涉及计算机技术领域,具体涉及一种训练文本生成方法、模型训练方法、装置及电子设备。The present application relates to the field of computer technology, and in particular, to a training text generation method, a model training method, an apparatus, and an electronic device.

背景技术Background technique

互联网是人们生活、工作的重要工具,随着互联网开放程度越来越大,互联网上充斥着大量不适合展示给用户的敏感文本,例如,用户的网络留言、用户在聊天软件上的对话信息、对话机器人回复的信息等。为了营造绿色聊天环境,需要提前识别并过滤掉这些敏感文本。The Internet is an important tool for people to live and work. With the increasing openness of the Internet, the Internet is filled with a large number of sensitive texts that are not suitable for display to users, such as users' online messages, users' conversation information on chat software, The information replied by the dialogue robot, etc. In order to create a green chat environment, these sensitive texts need to be identified and filtered out in advance.

相关技术中,可以使用文本识别模型来识别出敏感文本。然而,文本识别模型需要预先通过大量不同表述的敏感文本进行训练,由于目前互联网上很难搜集到大批量的敏感文本,而人工编写敏感文本效率很低,且人工编写数量有限,因此,如何快速、高效地获取到训练文本以训练文本识别模型是需要解决的问题。In the related art, a text recognition model can be used to identify sensitive text. However, the text recognition model needs to be pre-trained with a large number of sensitive texts with different expressions. Since it is difficult to collect a large number of sensitive texts on the Internet, the efficiency of manually writing sensitive texts is very low, and the number of manual writing is limited. Therefore, how to quickly , Efficient acquisition of training text to train text recognition models is a problem that needs to be solved.

发明内容SUMMARY OF THE INVENTION

本申请提供了一种训练文本生成方法、模型训练方法、文本识别方法、装置电子设备及计算机可读存储介质,能够更快速、高效地获取到训练文本,以便于训练文本识别模型。具体方案如下:The present application provides a training text generation method, a model training method, a text recognition method, a device electronic device, and a computer-readable storage medium, which can acquire the training text more quickly and efficiently, so as to facilitate the training of a text recognition model. The specific plans are as follows:

第一方面,本申请提供了一种训练文本生成方法,所述训练文本用于对待训练模型进行训练,以得到文本识别模型,所述方法包括:In a first aspect, the present application provides a method for generating training text, where the training text is used to train a model to be trained to obtain a text recognition model, and the method includes:

获取引导文本,所述引导文本与目标文本的语义属性相一致,所述目标文本为所述文本识别模型识别出的正例文本;Obtaining guide text, the guide text is consistent with the semantic attribute of the target text, and the target text is the positive example text identified by the text recognition model;

将所述引导文本输入基于引导的文本生成模型中,得到与所述引导文本的语义属性相一致的输出文本;Inputting the guidance text into a guidance-based text generation model to obtain output text consistent with the semantic attributes of the guidance text;

根据所述输出文本确定训练文本。Training text is determined from the output text.

可选地,在所述将所述引导文本输入基于引导的文本生成模型中之前,所述方法还包括:Optionally, before inputting the guidance text into the guidance-based text generation model, the method further includes:

获取提问文本;Get the question text;

所述将所述引导文本输入基于引导的文本生成模型中,得到与所述引导文本语义属性相一致的输出文本,包括:Said inputting the guidance text into a guidance-based text generation model to obtain output text consistent with the semantic attributes of the guidance text, including:

将所述提问文本和所述引导文本输入基于引导的对话生成模型中,得到用于回复所述提问文本、且与所述引导文本语义属性相一致的输出文本。The question text and the guidance text are input into a guidance-based dialogue generation model, and an output text for replying to the question text and consistent with the semantic attributes of the guidance text is obtained.

可选地,所述输出文本包括多条;Optionally, the output text includes multiple pieces;

所述根据所述输出文本确定训练文本,包括:The determining of the training text according to the output text includes:

从多条所述输出文本中确定训练文本。A training text is determined from a plurality of said output texts.

可选地,所述从多条所述输出文本中确定训练文本,包括:Optionally, the determining the training text from a plurality of the output texts includes:

通过第一策略确定训练文本,所述第一策略包括:从多条所述输出文本中选择包含至少一个预设关键词的文本作为训练文本,所述预设关键词与所述目标文本的语义属性相一致;The training text is determined by a first strategy, the first strategy includes: selecting a text containing at least one preset keyword from a plurality of the output texts as the training text, the preset keyword and the semantics of the target text properties are consistent;

或者,通过第二策略确定训练文本,所述第二策略包括:从多条所述输出文本中选择第一条文本或随机选择一条文本作为训练文本。Alternatively, the training text is determined through a second strategy, where the second strategy includes: selecting a first text from a plurality of the output texts or randomly selecting a text as the training text.

可选地,选择所述第一策略确定所述训练文本的概率为第一预设概率,选择所述第二策略确定所述训练文本的概率为第二预设概率,所述第一预设概率大于所述第二预设概率,且所述第一预设概率与所述第二预设概率之和为1。Optionally, selecting the first strategy to determine the probability of the training text is a first preset probability, and selecting the second strategy to determine the probability of the training text is a second preset probability, the first preset probability. The probability is greater than the second preset probability, and the sum of the first preset probability and the second preset probability is 1.

可选地,所述第一预设概率的范围可以为0.7~0.9,所述第二预设概率的范围可以为0.1~0.3。Optionally, the range of the first preset probability may be 0.7-0.9, and the range of the second preset probability may be 0.1-0.3.

可选地,所述引导文本包括至少一个引导词,每一所述引导词与所述目标文本的语义属性相一致;Optionally, the guide text includes at least one guide word, and each of the guide words is consistent with the semantic attribute of the target text;

所述预设关键词包括:各所述引导词。The preset keywords include: each of the guide words.

可选地,所述预设关键词还包括:各第一目标词,所述第一目标词为任意一条所述输出文本中包含的、与所述目标文本语义属性相一致、且与各所述引导词均不同的词。Optionally, the preset keywords further include: each first target word, the first target word is contained in any piece of the output text, is consistent with the semantic attributes of the target text, and is consistent with each Describe words with different leading words.

可选地,所述第一策略还包括:当多条所述输出文本均未包含任一所述预设关键词时,选择多条所述输出文本中的第一条以确定训练文本。Optionally, the first strategy further includes: when none of the multiple pieces of the output text contains any of the preset keywords, selecting the first piece of the multiple pieces of the output text to determine the training text.

可选地,所述正例文本的语义属性为语义敏感的文本,所述目标文本的语义属性为语义敏感的文本,所述文本识别模型用于对对话生成模型所生成的文本进行识别。Optionally, the semantic attribute of the positive example text is semantically sensitive text, the semantic attribute of the target text is semantically sensitive text, and the text recognition model is used to recognize the text generated by the dialogue generation model.

第二方面,本申请实施例还提供了一种文本识别模型的训练方法,包括:In a second aspect, the embodiments of the present application also provide a method for training a text recognition model, including:

获取训练样本,所述训练样本包括正例样本和负例样本,所述正例样本对应的文本包括:通过第一方面任一项所述的训练文本生成方法所生成的训练文本;Acquiring training samples, the training samples include positive samples and negative samples, and the text corresponding to the positive samples includes: training text generated by any one of the training text generating methods described in the first aspect;

使用所述训练样本对待训练模型进行训练,得到文本识别模型。The model to be trained is trained using the training samples to obtain a text recognition model.

可选地,所述训练方法还包括:Optionally, the training method further includes:

获取第一文本,所述第一文本为所述文本识别模型识别错误的文本,所述识别错误的文本的实际语义属性与所述文本识别模型对所述识别错误的文本所识别出的语义属性不同;Obtain a first text, where the first text is the text that the text recognition model has identified incorrectly, and the actual semantic attributes of the incorrectly identified text and the semantic attributes identified by the text recognition model for the incorrectly identified text different;

对所述第一文本进行标注,得到第一样本;annotating the first text to obtain a first sample;

使用所述第一样本对所述文本识别模型进行优化训练。The text recognition model is optimally trained using the first sample.

可选地,在所述使用所述第一样本对所述文本识别模型进行优化训练之前,所述训练方法还包括:Optionally, before using the first sample to optimize the training of the text recognition model, the training method further includes:

获取第二文本,所述第二文本中包含第二目标词,且所述第二文本与所述第一文本所表达的语义属性相反,所述第二目标词为所述第一文本中包含的、与所述目标文本所表达的语义属性相一致的词;Obtain a second text, the second text contains a second target word, and the second text is opposite to the semantic attribute expressed by the first text, and the second target word is contained in the first text. words that are consistent with the semantic attributes expressed by the target text;

对所述第二文本进行标注,得到第二样本,所述第二样本与所述第一样本的标注信息相反;Annotating the second text to obtain a second sample, the second sample is opposite to the annotation information of the first sample;

所述使用所述第一样本对所述文本识别模型进行优化训练,包括:The performing optimization training on the text recognition model using the first sample includes:

使用所述第一样本和所述第二样本对所述文本识别模型进行优化训练。The text recognition model is optimally trained using the first sample and the second sample.

可选地,所述训练样本包括回复样本以及问答拼接样本;Optionally, the training samples include reply samples and question and answer splicing samples;

所述回复样本中正例样本对应的文本包括:通过第一方面中通过将所述提问文本和所述引导文本输入基于引导的对话生成模型中的方式确定出的训练文本;The text corresponding to the positive sample in the reply sample includes: the training text determined by the method of inputting the question text and the guide text into the guide-based dialogue generation model in the first aspect;

所述问答拼接样本对应的文本为拼接文本,所述拼接文本包括:将提问文本与对应于该提问文本的回复文本进行拼接后形成的文本。The text corresponding to the question and answer splicing sample is a spliced text, and the spliced text includes: a text formed by splicing the question text and the reply text corresponding to the question text.

第三方面,本申请实施例还提供了一种文本识别方法,其特征在于,包括:In a third aspect, an embodiment of the present application also provides a text recognition method, which is characterized by comprising:

获取待识别文本;Get the text to be recognized;

将所述待识别文本输入文本识别模型中,得到对所述待识别文本的识别结果,其中,所述文本识别模型是通过第一方面中任一项所述的训练方法进行训练得到的。The text to be recognized is input into a text recognition model, and a recognition result of the text to be recognized is obtained, wherein the text recognition model is obtained by training the training method described in any one of the first aspects.

可选地,所述待识别文本为对话生成模型所生成的文本;Optionally, the text to be recognized is text generated by a dialogue generation model;

或者,所述待识别文本为将用户的提问文本以及对话生成模型所生成的文本进行拼接后形成的文本,其中,所述文本识别模型是通过第二方面所述的模型训练方法中,当训练样本包括回复样本以及问答拼接样本时的模型训练方法进行训练得到的。Alternatively, the text to be recognized is the text formed by splicing the user's question text and the text generated by the dialogue generation model, wherein the text recognition model is obtained by using the model training method described in the second aspect. The samples include the reply samples and the model training method when the question and answer splicing samples are trained.

第四方面,本申请还提供了一种训练文本生成装置,所述训练文本用于对待训练模型进行训练,以得到文本识别模型,所述装置包括:In a fourth aspect, the present application also provides a training text generating device, the training text is used to train a model to be trained to obtain a text recognition model, and the device includes:

信息获取单元,用于获取引导文本,所述引导文本与目标文本的语义属性相一致,所述目标文本为所述文本识别模型识别出的正例文本;an information acquisition unit, used for acquiring guide text, the guide text is consistent with the semantic attribute of the target text, and the target text is the positive example text recognized by the text recognition model;

文本生成单元,用于将所述引导文本输入基于引导的文本生成模型中,得到与所述引导文本的语义属性相一致的输出文本;A text generation unit, for inputting the guidance text into a guidance-based text generation model to obtain output text consistent with the semantic attributes of the guidance text;

文本确定单元,用于根据所述输出文本确定训练文本。A text determination unit, configured to determine training text according to the output text.

可选地,所述装置还包括:Optionally, the device further includes:

第一文本获取单元,用于获取提问文本;a first text acquisition unit, used for acquiring question text;

所述文本生成单元,具体用于:将所述提问文本和所述引导文本输入基于引导的对话生成模型中,得到用于回复所述提问文本、且与所述引导文本语义属性相一致的输出文本。The text generation unit is specifically configured to: input the question text and the guidance text into a guidance-based dialogue generation model, and obtain an output that is used to reply to the question text and is consistent with the semantic attributes of the guidance text text.

可选地,所述输出文本包括多条;Optionally, the output text includes multiple pieces;

文本确定单元具体用于:从多条所述输出文本中确定训练文本。The text determining unit is specifically used for: determining training text from a plurality of the output texts.

可选地,文本确定单元具体用于:Optionally, the text determination unit is specifically used for:

通过第一策略确定训练文本,所述第一策略包括:从多条所述输出文本中选择包含至少一个预设关键词的文本作为训练文本,所述预设关键词与所述目标文本的语义属性相一致;The training text is determined by a first strategy, the first strategy includes: selecting a text containing at least one preset keyword from a plurality of the output texts as the training text, the preset keyword and the semantics of the target text properties are consistent;

或者,通过第二策略确定训练文本,所述第二策略包括:从多条所述输出文本中选择第一条文本或随机选择一条文本作为训练文本。Alternatively, the training text is determined through a second strategy, where the second strategy includes: selecting a first text from a plurality of the output texts or randomly selecting a text as the training text.

可选地,选择所述第一策略确定所述训练文本的概率为第一预设概率,选择所述第二策略确定所述训练文本的概率为第二预设概率,所述第一预设概率大于所述第二预设概率,且所述第一预设概率与所述第二预设概率之和为1。Optionally, selecting the first strategy to determine the probability of the training text is a first preset probability, and selecting the second strategy to determine the probability of the training text is a second preset probability, the first preset probability. The probability is greater than the second preset probability, and the sum of the first preset probability and the second preset probability is 1.

可选地,所述第一预设概率的范围可以为0.7~0.9,所述第二预设概率的范围可以为0.1~0.3。Optionally, the range of the first preset probability may be 0.7-0.9, and the range of the second preset probability may be 0.1-0.3.

可选地,所述引导文本包括至少一个引导词,每一所述引导词与所述目标文本的语义属性相一致;Optionally, the guide text includes at least one guide word, and each of the guide words is consistent with the semantic attribute of the target text;

所述预设关键词包括:各所述引导词。The preset keywords include: each of the guide words.

可选地,所述预设关键词还包括:各第一目标词,所述第一目标词为任意一条所述输出文本中包含的、与所述目标文本语义属性相一致、且与各所述引导词均不同的词。Optionally, the preset keywords further include: each first target word, the first target word is contained in any piece of the output text, is consistent with the semantic attributes of the target text, and is consistent with each Describe words with different leading words.

可选地,所述第一策略还包括:当多条所述输出文本均未包含任一所述预设关键词时,选择多条所述输出文本中的第一条以确定训练文本。Optionally, the first strategy further includes: when none of the multiple pieces of the output text contains any of the preset keywords, selecting the first piece of the multiple pieces of the output text to determine the training text.

可选地,所述正例文本的语义属性为语义敏感的文本,所述目标文本的语义属性为语义敏感的文本,所述文本识别模型用于对对话生成模型所生成的文本进行识别。Optionally, the semantic attribute of the positive example text is semantically sensitive text, the semantic attribute of the target text is semantically sensitive text, and the text recognition model is used to recognize the text generated by the dialogue generation model.

第五方面,本申请实施例还提供了一种文本识别模型的训练装置,包括:In a fifth aspect, the embodiment of the present application also provides a training device for a text recognition model, including:

样本获取单元,用于获取训练样本,所述训练样本包括正例样本和负例样本,所述正例样本对应的文本包括:通过第四方面任一项所述的训练文本生成装置所生成的训练文本;A sample acquisition unit, configured to acquire training samples, the training samples include positive samples and negative samples, and the text corresponding to the positive samples includes: generated by the training text generating device according to any one of the fourth aspects training text;

模型训练单元,用于使用所述训练样本对待训练模型进行训练,得到文本识别模型。A model training unit, configured to use the training samples to train the model to be trained to obtain a text recognition model.

可选地,所述训练装置还包括:Optionally, the training device further includes:

第二文本获取单元,用于获取第一文本,所述第一文本为所述文本识别模型识别错误的文本,所述识别错误的文本的实际语义属性与所述文本识别模型对所述识别错误的文本所识别出的语义属性不同;A second text acquisition unit, configured to acquire a first text, where the first text is the text that the text recognition model has identified incorrectly, and the actual semantic attribute of the incorrectly recognized text is related to the recognition error of the text by the text recognition model. The semantic attributes recognized by the text are different;

样本标注单元,用于对所述第一文本进行标注,得到第一样本;a sample labeling unit, configured to label the first text to obtain a first sample;

模型优化单元,用于使用所述第一样本对所述文本识别模型进行优化训练。A model optimization unit, configured to perform optimization training on the text recognition model by using the first sample.

可选地,所述第二文本获取单元还用于:Optionally, the second text acquisition unit is further configured to:

获取第二文本,所述第二文本中包含第二目标词,且所述第二文本与所述第一文本所表达的语义属性相反,所述第二目标词为所述第一文本中包含的、与所述目标文本所表达的语义属性相一致的词;Obtain a second text, the second text contains a second target word, and the second text is opposite to the semantic attribute expressed by the first text, and the second target word is contained in the first text. words that are consistent with the semantic attributes expressed by the target text;

所述样本标注单元还用于:对所述第二文本进行标注,得到第二样本,所述第二样本与所述第一样本的标注信息相反;The sample labeling unit is further configured to: label the second text to obtain a second sample, where the labeling information of the second sample is opposite to that of the first sample;

所述模型优化单元具体用于:使用所述第一样本和所述第二样本对所述文本识别模型进行优化训练。The model optimization unit is specifically configured to: use the first sample and the second sample to perform optimization training on the text recognition model.

可选地,所述训练样本包括回复样本以及问答拼接样本;Optionally, the training samples include reply samples and question and answer splicing samples;

所述回复样本中正例样本对应的文本包括:通过第一方面中通过将所述提问文本和所述引导文本输入基于引导的对话生成模型中的方式确定出的训练文本;The text corresponding to the positive sample in the reply sample includes: the training text determined by the method of inputting the question text and the guide text into the guide-based dialogue generation model in the first aspect;

所述问答拼接样本对应的文本为拼接文本,所述拼接文本包括:将提问文本与对应于该提问文本的回复文本进行拼接后形成的文本。The text corresponding to the question and answer splicing sample is a spliced text, and the spliced text includes: a text formed by splicing the question text and the reply text corresponding to the question text.

第六方面,本申请实施例还提供了一种文本识别装置,包括:In a sixth aspect, an embodiment of the present application further provides a text recognition device, including:

第三文本获取单元,用于获取待识别文本;a third text acquisition unit, used for acquiring the text to be recognized;

文本识别单元,用于将所述待识别文本输入文本识别模型中,得到对所述待识别文本的识别结果,其中,所述文本识别模型是通过第五方面任一项所述的训练装置进行训练得到的。A text recognition unit, configured to input the text to be recognized into a text recognition model, and obtain a recognition result of the text to be recognized, wherein the text recognition model is performed by the training device according to any one of the fifth aspects obtained by training.

可选地,所述待识别文本为对话生成模型所生成的文本;Optionally, the text to be recognized is text generated by a dialogue generation model;

或者,所述待识别文本为将用户的提问文本以及对话生成模型所生成的文本进行拼接后形成的文本,其中,所述文本识别模型是通过第一方面所述的训练方法当训练样本包括回复样本以及问答拼接样本时进行训练得到的。Alternatively, the text to be recognized is the text formed by splicing the user's question text and the text generated by the dialogue generation model, wherein the text recognition model is obtained by using the training method described in the first aspect when the training sample includes a reply The samples and the question and answer splicing samples are obtained by training.

第七方面,本申请实施例还提供了一种电子设备,包括:In a seventh aspect, an embodiment of the present application further provides an electronic device, including:

处理器;以及processor; and

存储器,用于存储数据处理程序,该电子设备通电并通过所述处理器运行该程序后,执行如第一方面任一项所述的方法。The memory is used to store a data processing program, and after the electronic device is powered on and runs the program through the processor, the method according to any one of the first aspects is executed.

第八方面,本申请实施例还提供了一种电子设备,包括:In an eighth aspect, an embodiment of the present application further provides an electronic device, including:

处理器;以及processor; and

存储器,用于存储数据处理程序,该电子设备通电并通过所述处理器运行该程序后,执行如第二方面任一项所述的方法。The memory is used to store a data processing program, and after the electronic device is powered on and runs the program through the processor, the method according to any one of the second aspects is executed.

第九方面,本申请实施例还提供了一种电子设备,包括:In a ninth aspect, an embodiment of the present application also provides an electronic device, including:

处理器;以及processor; and

存储器,用于存储数据处理程序,该电子设备通电并通过所述处理器运行该程序后,执行如第三方面任一项所述的方法。The memory is used to store a data processing program, and after the electronic device is powered on and runs the program through the processor, the method according to any one of the third aspects is executed.

第十方面,本申请实施例还提供了一种计算机可读存储介质,存储有数据处理程序,该程序被处理器运行,执行如第一方面任一项所述的方法。In a tenth aspect, an embodiment of the present application further provides a computer-readable storage medium storing a data processing program, where the program is executed by a processor to execute the method described in any one of the first aspect.

第十一方面,本申请实施例还提供了一种计算机可读存储介质,存储有数据处理程序,该程序被处理器运行,执行如第二方面任一项所述的方法。In an eleventh aspect, an embodiment of the present application further provides a computer-readable storage medium storing a data processing program, where the program is executed by a processor to execute the method described in any one of the second aspect.

第十二方面,本申请实施例还提供了一种计算机可读存储介质,存储有数据处理程序,该程序被处理器运行,执行如第三方面任一项所述的方法。In a twelfth aspect, an embodiment of the present application further provides a computer-readable storage medium, which stores a data processing program, and the program is executed by a processor to execute the method according to any one of the third aspects.

与现有技术相比,本申请具有以下优点:Compared with the prior art, the present application has the following advantages:

本申请提供的训练文本的生成方法,将引导文本输入基于引导的文本生成模型中后,能够得到与引导文本语义属性相一致的输出文本,由于引导文本与目标文本语义属性相一致,目标文本为文本识别模型识别出的正例文本,所以,得到的输出文本也与文本识别模型用于识别出的正例文本的语义属性相一致,这样,根据得到的输出文本所确定的训练文本能够作为正例样本文本,以用于对待训练模型进行训练。In the method for generating training text provided by the present application, after inputting the guiding text into the guiding-based text generation model, an output text consistent with the semantic attributes of the guiding text can be obtained. Since the guiding text is consistent with the semantic attributes of the target text, the target text is The positive text recognized by the text recognition model, therefore, the obtained output text is also consistent with the semantic attributes of the positive text recognized by the text recognition model. In this way, the training text determined according to the obtained output text can be used as positive text. Example sample text for training the model to be trained.

本申请通过基于引导的文本生成模型自动生成输出文本,从而确定出训练文本,可以更加快速、高效地得到训练文本,且由于文本生成模型能够生成丰富多样的输出文本,所以根据输出文本所确定的训练文本的多样性也更好,从而能够提高训练得到的文本识别模型的识别准确度,使得文本识别模型能够更准确地识别出正例文本。In the present application, the output text is automatically generated by the guided text generation model, so that the training text can be determined, and the training text can be obtained more quickly and efficiently. The diversity of the training text is also better, so that the recognition accuracy of the text recognition model obtained by training can be improved, so that the text recognition model can more accurately recognize the positive text.

附图说明Description of drawings

图1是本申请实施例提供的训练文本生成方法的流程图;1 is a flowchart of a training text generation method provided by an embodiment of the present application;

图2是本申请实施例提供的训练文本生成方法的另一例的流程图;2 is a flowchart of another example of a training text generation method provided by an embodiment of the present application;

图3是本申请实施例提供的文本识别模型训练方法的流程图;3 is a flowchart of a text recognition model training method provided by an embodiment of the present application;

图4是本申请实施例提供的文本识别模型训练方法的另一例的流程图;4 is a flowchart of another example of a text recognition model training method provided by an embodiment of the present application;

图5是本申请实施例提供的训练文本生成装置的单元框图;5 is a unit block diagram of a training text generating apparatus provided by an embodiment of the present application;

图6是本申请实施例提供的用于实现训练文本生成方法的电子设备的结构示意图。FIG. 6 is a schematic structural diagram of an electronic device for implementing a training text generation method provided by an embodiment of the present application.

具体实施方式Detailed ways

在下面的描述中阐述了很多具体细节以便于充分理解本申请。但是本申请能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本申请内涵的情况下做类似推广,因此本申请不受下面公开的具体实施的限制。In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. However, the present application can be implemented in many other ways different from those described herein, and those skilled in the art can make similar promotions without violating the connotation of the present application. Therefore, the present application is not limited by the specific implementation disclosed below.

智能聊天技术能够对用户提出的问题自动回复,其在电商或公共服务等领域的智能客服、聊天机器Intelligent chat technology can automatically reply to the questions raised by users, and its intelligent customer service and chat machines in the fields of e-commerce or public services

人、游戏闲聊等领域中应用非常广泛。It is widely used in people, game chat and other fields.

针对用户的问题,智能聊天设备可以从预存的问答数据库中检索出与用户问题相对应的回复内容。由于问答数据库中所存储的问题有限,对于问答数据库中未存储的问题,这种方式无法给出相应回复,导致不能较好地与用户进行更为流畅的问答,所回复的内容也比较单一,用户体验不佳。For the user's question, the intelligent chatting device can retrieve the reply content corresponding to the user's question from the pre-stored question and answer database. Due to the limited number of questions stored in the question-and-answer database, this method cannot give corresponding responses to the questions that are not stored in the question-and-answer database, resulting in a more fluent question-and-answer with the user, and the content of the responses is relatively simple. Bad user experience.

随着深度学习的发展,智能聊天场景逐渐开始使用对话生成模型来回复用户的问题,使得回复内容的多样性得到很大提升,并且还支持连贯的多轮聊天,用户的使用体验更好。With the development of deep learning, the intelligent chat scene has gradually begun to use the dialogue generation model to reply to the user's questions, which greatly improves the diversity of the reply content, and also supports multiple rounds of coherent chats, and the user experience is better.

但是,对话生成模型通常是基于海量的样本数据训练得到的,这些样本数据中难免包含一些敏感文本,例如,包含谩骂、暴力等敏感文本,因此,对话生成模型会学到这些敏感文本的表述,导致对话生成模型可能会生成敏感文本而回复给用户。为了营造绿色聊天环境,需要提前识别并过滤掉生成的敏感文本。However, the dialogue generation model is usually trained based on massive sample data, which inevitably contains some sensitive texts, such as abuse, violence and other sensitive texts. Therefore, the dialogue generation model will learn the expressions of these sensitive texts. As a result, the dialogue generation model may generate sensitive text to reply to the user. In order to create a green chat environment, the generated sensitive text needs to be identified and filtered out in advance.

相关技术中,可以通过基于字或词的方式来过滤敏感文本。比如,直接过滤掉包含“做”、“你娘”、“你生的”等敏感词的文本,如智能对话设备回复“我想和你做”就会被直接过滤掉。由于包含“做”这个字的文本大部分都不是敏感文本,如“做饭”、“做家务”、“做运动”等,因此,还会设立一个白名单表,不过滤在白名单表中的文本。比如将“做家务”放到白名单表中,那么智能对话设备回复“今天我在家做家务”就不会被过滤掉。In the related art, sensitive text can be filtered in a word-based or word-based manner. For example, texts containing sensitive words such as "do", "your mother" and "you were born" are directly filtered out. For example, if the intelligent dialogue device replies "I want to do it with you", it will be directly filtered out. Since most of the texts containing the word "do" are not sensitive texts, such as "cooking", "doing housework", "doing sports", etc., a whitelist table will also be set up and not filtered in the whitelist table text. For example, if "doing housework" is put into the whitelist, the reply of "I am doing housework at home today" from the smart conversation device will not be filtered out.

但是,由于敏感词的数量十分巨大,而所罗列敏感词数量有限,因此,会遗漏很多敏感文本,导致很多敏感的输出文本无法被过滤。其次,由于白名单也是罗列不尽的,因此会过滤掉很多正常文本,导致对话的质量下降。此外,这种方式只可以过滤掉显示出现敏感词的文本,无法过滤掉不包含敏感词但语义敏感的文本,比如“我想要你”。However, due to the huge number of sensitive words and the limited number of sensitive words listed, many sensitive texts will be missed, resulting in many sensitive output texts that cannot be filtered. Second, since the whitelist is also endless, many normal texts will be filtered out, resulting in a decrease in the quality of the conversation. In addition, this method can only filter out texts that display sensitive words, but cannot filter out texts that do not contain sensitive words but are semantically sensitive, such as "I want you".

相关技术中,还可以通过基于正则表达式的方式来过滤敏感文本。比如,当智能对话设备的人设是一个儿童时,智能对话设备回复的文本若表达自己要生孩子或者有孩子是敏感文本。这种情况下,正则表达式可以为“生.*个.*孩子”或“有.*个.*孩子”,其中“.”匹配除了换行符以外的任意字符,“*”表示匹配前面0个或多个字符。当智能对话设备回复“猫妈妈生了2个孩子”、“你生了1个孩子”时,都会因为匹配上正则表达式“生.*个.*孩子”而被过滤掉。In the related art, sensitive text can also be filtered in a way based on regular expressions. For example, when the person setting of the intelligent dialogue device is a child, the text replied by the intelligent dialogue device expressing that he wants to have a child or that he has a child is a sensitive text. In this case, the regular expression can be "has .* .* children" or "has .* .* children", where "." matches any character except a newline, and "*" matches the preceding 0 or more characters. When the smart dialogue device replies "Mother cat has given birth to 2 children" and "You have given birth to 1 child", it will be filtered out because it matches the regular expression "born.*children.*children".

但是,由于无法穷尽所有的敏感正则表达式,导致很多敏感文本无法被识别出来。另外,也会误杀很多正常的文本,比如上述“猫妈妈生了2个孩子”会被误杀。另外,正则表达式的方式也只能过滤掉与正则表达式相匹配的文本,无法过滤掉与正则表达式不匹配的隐式敏感文本。However, due to the inability to exhaust all sensitive regular expressions, many sensitive texts cannot be recognized. In addition, many normal texts will be killed by mistake. For example, the above-mentioned "Mother cat gave birth to 2 children" will be killed by mistake. In addition, the regular expression method can only filter out the text that matches the regular expression, but cannot filter out the implicitly sensitive text that does not match the regular expression.

为了提高敏感文本的识别准确率,可以使用识别模型来识别敏感文本,其中,识别模型为深度模型。然而,识别模型需要预先通过大量不同表述的敏感文本作为样本进行训练,由于目前互联网上很难搜集到大批量的敏感文本,而人工编写敏感文本的数量和多样性是有限的,因此,如何高效地获取到大量的训练文本(例如敏感文本)是需要解决的问题。In order to improve the recognition accuracy of sensitive text, a recognition model can be used to recognize sensitive text, wherein the recognition model is a deep model. However, the recognition model needs to be pre-trained with a large number of sensitive texts with different expressions as samples. Since it is difficult to collect a large number of sensitive texts on the Internet, and the number and diversity of manually written sensitive texts are limited, how to efficiently Obtaining a large amount of training text (such as sensitive text) is a problem that needs to be solved.

为了更快速、高效地获取到大量的训练文本,本申请提供了一种训练文本生成方法、文本识别模型训练方法、文本识别方法以及与各方法相对应的装置、电子设备、以及计算机可读存储介质,以下提供实施例对上述方法、装置、电子设备以及计算机可读存储介质进行详细说明。In order to obtain a large amount of training text more quickly and efficiently, the present application provides a training text generation method, a text recognition model training method, a text recognition method, and apparatuses, electronic devices, and computer-readable storage corresponding to the methods The following provides embodiments to describe in detail the above-mentioned methods, apparatuses, electronic devices, and computer-readable storage media.

本申请第一实施例提供了一种训练文本生成方法,该训练文本用于对待训练模型进行训练,以得到文本识别模型。本申请实施例中,训练文本生成方法的执行主体为电子设备,该电子设备可以为台式电脑、笔记本电脑、平板电脑、服务器、手机等任意具有数据处理能力的电子设备。The first embodiment of the present application provides a method for generating training text, where the training text is used to train a model to be trained to obtain a text recognition model. In the embodiment of the present application, the execution body of the training text generation method is an electronic device, and the electronic device may be any electronic device with data processing capability, such as a desktop computer, a notebook computer, a tablet computer, a server, a mobile phone, and the like.

上述待训练模型可以包括bert模型、卷积神经网络、逻辑回归模型、K-近邻(K-NearestNeighbor,KNN)模型、逻辑回归模型、二分类模型中的至少一种,也可以为其他任意的深度学习模型。The above-mentioned models to be trained may include at least one of a bert model, a convolutional neural network, a logistic regression model, a K-nearest neighbor (K-Nearest Neighbor, KNN) model, a logistic regression model, and a two-class model, or may be any other depth Learning models.

上述文本识别模型能够确定出待识别文本为正例文本还是负例文本,上述文本识别模型也可以理解为文本分类模型。上述文本识别模型可以用于对中文文本进行识别,也可以对英文、法文、德文等外文文本进行识别。The above text recognition model can determine whether the text to be recognized is positive text or negative text, and the above text recognition model can also be understood as a text classification model. The above text recognition model can be used to recognize Chinese texts, and can also recognize foreign language texts such as English, French, and German.

正例文本指的是文本识别模型需要识别出的文本,例如,文本识别模型用于识别出包含地名的文本,那么包含地名的文本为正例文本,不包含地名的文本为负例文本,文本识别模型用于识别出敏感语义的文本,那么,敏感语义的文本为正例文本,非敏感语义的文本为负例文本。Positive example text refers to the text that the text recognition model needs to recognize. For example, if the text recognition model is used to recognize text containing place names, then the text containing place names is positive example text, and the text that does not contain place names is negative example text. The recognition model is used to identify texts with sensitive semantics. Then, the texts with sensitive semantics are positive example texts, and the texts with non-sensitive semantics are negative example texts.

如图1所示,本申请实施例提供的训练文本生成方法包括以下步骤 S110~S130。As shown in FIG. 1 , the training text generation method provided by the embodiment of the present application includes the following steps S110-S130.

步骤S110:获取引导文本。Step S110: Acquire guidance text.

上述引导文本与目标文本的语义属性相一致,该目标文本为文本识别模型识别出的正例文本。The above-mentioned guide text is consistent with the semantic attributes of the target text, and the target text is the positive example text recognized by the text recognition model.

上述引导文本可以是一段或多段文本、一句或多句文本、一个或多个词语,该词语可以是单字词,也可以是双字词或者多字词,引导文本也可以是其他形式的文本。本申请实施例中,可以是人工设定好引导文本后输入电子设备内,电子设备获取人工输入的引导文本;或者,也可以是电子设备根据目标文本自动确定引导文本,例如,电子设备可以自动识别目标文本的语义属性,并根据目标文本的语义属性确定出与该语义属性相一致的引导文本。The above-mentioned guide text can be one or more paragraphs of text, one or more sentences of text, and one or more words. . In the embodiment of the present application, the guide text may be manually set and then input into the electronic device, and the electronic device may obtain the manually input guide text; or, the electronic device may automatically determine the guide text according to the target text. For example, the electronic device may automatically determine the guide text. Identify the semantic attribute of the target text, and determine the guide text consistent with the semantic attribute according to the semantic attribute of the target text.

上述引导文本可以是中文文本或中文词,也可以为英文、德文等外文文本或外文词。The above-mentioned guiding text may be Chinese text or Chinese words, and may also be foreign language texts or foreign words such as English and German.

目标文本的语义属性,可以理解为目标文本的语义所属于的类型。具体的,目标文本的语义属性可以是语义敏感文本、涉密文本、科普文本、学术文本、医疗知识文本中的至少一种,目标文本的语义属性也可以是其他具体的语义属性,本领域技术人员可以根据文本识别模型需要识别出的正例文本的语义属性确定目标文本的语义属性,本申请不限定目标文本的具体语义属性。The semantic attribute of the target text can be understood as the type to which the semantics of the target text belong. Specifically, the semantic attribute of the target text may be at least one of semantic-sensitive text, confidential text, popular science text, academic text, and medical knowledge text, and the semantic attribute of the target text may also be other specific semantic attributes. Personnel can determine the semantic properties of the target text according to the semantic properties of the positive example text that the text recognition model needs to recognize, and this application does not limit the specific semantic properties of the target text.

例如,若文本识别模型用于识别出语义敏感文本,则引导文本的语义属性可以为敏感文本,若训练后的文本识别模型用于识别出涉密文本,则引导文本的语义属性可以为涉密文本。For example, if the text recognition model is used to identify semantically sensitive text, the semantic attribute of the guide text can be sensitive text, and if the trained text recognition model is used to identify confidential text, the semantic attribute of the guide text can be classified as confidential text.

可选地,上述敏感文本可以包括暴力文本、涉黄文本、语言攻击文本或者其他不健康的文本。Optionally, the above-mentioned sensitive text may include violent text, pornographic text, language attack text or other unhealthy text.

举例说明,若文本识别模型识别出的正例文本为暴力、涉黄、谩骂等敏感属性的文本,则引导文本可以包括一个或多个具有暴力、涉黄、谩骂等敏感语义属性的单字词或多字词。For example, if the positive text identified by the text recognition model is text with sensitive attributes such as violence, pornography, and abuse, the guiding text may include one or more single-character words with sensitive semantic attributes such as violence, pornography, and abuse. or multiple words.

引导文本包含的各个单字词或多字词可以称为引导词,即引导文本包含一个或多个引导词,每一引导词与目标文本的语义属性相一致。当引导文本包含一个或多个引导词时,由于引导词较短,其语义更容易被获取到,因此,当引导文本包含引导词时,更便于基于引导的文本生成模型生成与引导词的语义属性一致的文本。Each single-word or multi-word contained in the guide text may be called guide words, that is, the guide text contains one or more guide words, and each guide word is consistent with the semantic attributes of the target text. When the guide text contains one or more guide words, the semantics of the guide words are easier to obtain because the guide words are shorter. Therefore, when the guide text contains guide words, it is more convenient for the guide-based text generation model to generate the semantics of the guide words. Attribute consistent text.

引导文本包含的引导词的数量范围可以为5~15个,例如,引导文本包含的引导词的数量为5个、8个、10个、12个、15个等。引导词的数量不宜过多或过少,过多会使得基于引导的文本生成模型的运算复杂度过高,从而影响输出文本的输出效率,甚至运算错误而无法得到输出文本,过少会使得输出文本与目标文本的语义属性差别较大。The number of guiding words included in the guiding text may range from 5 to 15, for example, the number of guiding words included in the guiding text is 5, 8, 10, 12, 15, and so on. The number of guiding words should not be too much or too little. Too much will make the computational complexity of the guiding-based text generation model too high, which will affect the output efficiency of the output text, and even the output text cannot be obtained due to the operation error. The semantic properties of text and target text are quite different.

引导词可以作为前缀文本,前缀文本是指在每次进行训练文本生成方法时均不变的词,前缀文本的语义更稳定、引导性更强。也就是说,对于生成同一语义属性的训练样本而言,在每次通过本申请提供的训练文本生成方法进行文本生成时,前缀文本均不变,以使得每次得到的训练文本均与前缀文本的语义属性一致。Guiding words can be used as prefix texts. Prefix texts refer to words that remain unchanged every time the text generation method is trained. The semantics of prefix texts are more stable and more guiding. That is to say, for training samples that generate the same semantic attribute, the prefix text remains unchanged every time the text is generated by the training text generation method provided in this application, so that the training text obtained each time is the same as the prefix text. have the same semantic properties.

步骤S120:将引导文本输入基于引导的文本生成模型中,得到与引导文本的语义属性相一致的输出文本。Step S120: Input the guide text into the guide-based text generation model to obtain output text consistent with the semantic attributes of the guide text.

上述基于引导的文本生成模型为预先训练好的模型。本申请实施例中,可以基于引导样本以及与引导样本对应的文本样本对ELMO、OpenAIGPT、BERT 或OpenAIGPT-2等神经网络模型等深度模型进行训练,从而得到基于引导的文本生成模型。其中,引导样本和文本样本可以从小说、剧本、杂志或期刊上的文章等资料上获取并标注,本领域技术人员可以根据常规的模型训练方法训练得到基于引导的文本生成模型,本申请不再详述。The above guidance-based text generation model is a pre-trained model. In the embodiments of the present application, deep models such as neural network models such as ELMO, OpenAIGPT, BERT, or OpenAIGPT-2 can be trained based on the guide samples and the text samples corresponding to the guide samples, thereby obtaining a guide-based text generation model. Among them, guide samples and text samples can be obtained and marked from novels, scripts, articles in magazines or journals and other materials, and those skilled in the art can train and obtain guide-based text generation models according to conventional model training methods. detail.

步骤S130:根据上述输出文本确定训练文本。Step S130: Determine the training text according to the above output text.

步骤S130中,可以直接将输出文本确定为训练文本,也可以根据输出文本进行语句扩展,得到与输出文本语义属性相一致的扩展文本,将输出文本和扩展文本确定为训练样本,或者,也可以根据输出文本通过其他方式确定出训练文本。In step S130, the output text can be directly determined as the training text, or the sentence expansion can be performed according to the output text to obtain the extended text consistent with the semantic attributes of the output text, and the output text and the extended text can be determined as the training samples, or, The training text is determined by other means based on the output text.

本申请实施例中,由于引导文本与文本识别模型识别出的正例文本的语义属性相一致,当得到的输出文本与引导文本的语义属性相一致时,根据输出文本确定出的样本文本与文本识别模型用于识别出的正例文本的语义属性也相一致。In the embodiment of the present application, since the semantic attributes of the guide text and the positive text recognized by the text recognition model are consistent, when the obtained output text is consistent with the semantic attributes of the guide text, the sample text determined according to the output text and the text The semantic properties of the positive texts identified by the recognition model are also consistent.

例如,当引导文本的语义属性为谩骂属性的文本时,输出文本的语义属性也为谩骂属性的文本,根据输出文本确定出的训练文本的语义属性也为谩骂属性的文本,因此训练文本能够作为正例样本对应的文本而对待训练模型进行训练,从而使得训练得到的文本识别模型能够识别出谩骂属性的文本,谩骂属性的文本即正例文本。For example, when the semantic attribute of the guide text is abusive text, the semantic attribute of the output text is also abusive text, and the semantic attribute of the training text determined according to the output text is also abusive text, so the training text can be used as abusive text. The training model is trained according to the text corresponding to the positive sample, so that the text recognition model obtained by training can recognize the text with abusive attributes, and the text with abusive attributes is the positive example text.

本申请通过基于引导的文本生成模型自动生成输出文本,从而确定出训练文本,可以更加快速、高效地得到训练文本,且由于文本生成模型能够生成丰富多样的输出文本,所以根据输出文本所确定的训练文本的多样性也更好,从而能够提高训练得到的文本识别模型的识别准确度,使得文本识别模型能够更准确地识别出正例文本。In the present application, the output text is automatically generated by the guided text generation model, so that the training text can be determined, and the training text can be obtained more quickly and efficiently. The diversity of the training text is also better, so that the recognition accuracy of the text recognition model obtained by training can be improved, so that the text recognition model can more accurately recognize the positive text.

在一种实施方式中,如图2所示,在步骤S120之前,还可以包括以下步骤 S140。In an embodiment, as shown in FIG. 2 , before step S120, the following step S140 may also be included.

步骤S140:获取提问文本。Step S140: Obtain the question text.

步骤S120可以按以下步骤S121实现。Step S120 can be implemented in the following step S121.

步骤S121:将上述提问文本和上述引导文本输入基于引导的对话生成模型中,得到用于回复上述提问文本、且与上述引导文本语义属性相一致的输出文本。Step S121 : Input the question text and the guidance text into the guidance-based dialogue generation model, and obtain an output text for replying to the question text and consistent with the semantic attributes of the guidance text.

上述提问文本通过小说片段、台词剧本、社交媒体聊天记录等得到。提问文本充当的是用户提问的文本。提问文本可以是“请问几点了”、“公司地址在哪”等询问式文本,提问文本也可以说“祝你开心”、“我们是好朋友”、“天气不错”等闲聊式文本。提问文本可以是一句文本,也可以是多句文本。The above question texts are obtained through novel fragments, line scripts, social media chat records, etc. The question text acts as the text of the user's question. The question text can be inquiring text such as "What time is it?" and "Where is the company address?" The question text can also be chatting text such as "I wish you a good time", "We are good friends", and "The weather is nice". The question text can be one-sentence text or multiple-sentence text.

提问文本可以是用户输入电子设备内的文本,也可以是电子设备从存储的文本库里选择得到的文本。The question text may be text input by the user into the electronic device, or may be text selected by the electronic device from a stored text library.

本实施方式中,步骤S120中的文本生成模型即为步骤S121中的对话生成模型。In this embodiment, the text generation model in step S120 is the dialogue generation model in step S121.

上述基于引导的对话生成模型可以基于提问样本、引导样本、以及与引导样本和提问样本对应的回复样本对深度模型进行训练,从而得到基于引导的文本生成模型。其中,提问样本、引导样本和回复样本可以从小说、台词剧本、社交媒体的聊天数据等资料上获取并标注,本领域技术人员可以根据常规的模型训练方法训练得到基于引导的文本对话模型,本申请不再详述。The above-mentioned guidance-based dialogue generation model may train a deep model based on question samples, guidance samples, and response samples corresponding to the guidance samples and question samples, thereby obtaining a guidance-based text generation model. Among them, the question samples, guidance samples and reply samples can be obtained and marked from novels, line scripts, social media chat data and other materials, and those skilled in the art can train a text dialogue model based on guidance according to conventional model training methods. The application is no longer detailed.

本实施方式通过基于引导的对话生成模型生成输出文本,该输出文本由于是对提问文本的回复,所以,输出文本与智能聊天设备自动回复的内容更一致,这样,根据输出文本确定出训练文本后,基于训练文本所训练的文本识别模型更适合于对智能聊天设备自动生成的聊天信息进行识别,也更适合于对智能聊天设备通过对话生成模型所生成的回复文本进行识别。这样,当智能聊天设备通过对话生成模型生成了谩骂、暴力等不健康的敏感文本时,文本识别模型能够更准确地识别出对话生成模型所生成的敏感文本。In this embodiment, an output text is generated by a dialogue generation model based on guidance. Since the output text is a reply to the question text, the output text is more consistent with the content automatically replied by the intelligent chatting device. In this way, after the training text is determined according to the output text , the text recognition model trained based on the training text is more suitable for recognizing the chat information automatically generated by the intelligent chatting device, and also more suitable for recognizing the reply text generated by the intelligent chatting device through the dialogue generation model. In this way, when the intelligent chatting device generates unhealthy sensitive text such as abuse and violence through the dialogue generation model, the text recognition model can more accurately identify the sensitive text generated by the dialogue generation model.

可选地,基于引导的对话生成模型得出的输出文本可以包括一条,这种情况下,可以将该条输出文本确定为训练样本。Optionally, the output text derived from the guided dialogue generation model may include a piece of output text, in which case, the piece of output text may be determined as a training sample.

在一种实施方式中,上述输出文本可以包括多条,步骤S130可以按以下步骤S131实现:从多条输出文本中确定训练文本。In an implementation manner, the above-mentioned output text may include multiple pieces of text, and step S130 may be implemented in the following step S131: determining training text from multiple pieces of output text.

输出文本包括多条,即通过基于引导的文本生成模型得到多条输出文本。The output text includes multiple pieces, that is, multiple pieces of output text are obtained through the guided-based text generation model.

可选地,如图2所示,可以按以下步骤S131a从多条输出文本中确定训练文本。Optionally, as shown in FIG. 2 , the training text may be determined from a plurality of output texts according to the following step S131a.

步骤S131a:从多条输出文本中选择包含至少一个预设关键词的文本确定为训练文本。Step S131a: Select a text containing at least one preset keyword from a plurality of output texts and determine it as a training text.

本申请实施例中,可以将步骤S131a中确定训练文本的方式确定为第一策略。In this embodiment of the present application, the method of determining the training text in step S131a may be determined as the first strategy.

上述预设关键词的语义属性与目标文本的语义属性相一致。上述预设关键词可以为用户输入的词,预设关键词可以包括一个或者多个。例如,当文本识别模型识别出的正例文本为语义敏感的文本时,预设关键词可以包括:呻吟、揍、交、暴打等语义敏感的词。The semantic attributes of the preset keywords are consistent with the semantic attributes of the target text. The above preset keywords may be words input by the user, and the preset keywords may include one or more keywords. For example, when the positive example text recognized by the text recognition model is semantically sensitive text, the preset keywords may include: moaning, beating, dating, beating, and other semantically sensitive words.

当多条输出文本中存在多于一条包含预设关键词的文本,可以将各包含预设关键词的输出文本均确定为训练文本,也可以将各包含预设关键词的输出文本中的第一条确定为训练文本,或者将包含的预设关键词数量最多的输出文本确定为训练文本。When there is more than one text containing preset keywords in multiple output texts, each output text containing preset keywords can be determined as training text, or the first text in each output text containing preset keywords can be determined as training text. One is determined as training text, or the output text that contains the most preset keywords is determined as training text.

本实施例中,由于预设关键词与文本生成模型识别出的正例文本的语义属性一致,因此,包含预设关键词的输出文本与文本生成模型识别出的正例文本的语义属性更容易相一致,这样,以包含预设关键词的输出文本作为训练文本,能够使得训练得到的文本识别模型识别出正例文本的准确度更高。In this embodiment, since the preset keywords are consistent with the semantic attributes of the positive example text recognized by the text generation model, it is easier for the output text containing the preset keywords to match the semantic attributes of the positive example text recognized by the text generation model Consistently, in this way, using the output text containing the preset keywords as the training text enables the trained text recognition model to recognize the positive text with higher accuracy.

可选地,如图2所示,也可以按以下步骤S131b或者步骤S131c从多条输出文本中确定训练文本。Optionally, as shown in FIG. 2 , the training text may also be determined from a plurality of output texts according to the following step S131b or step S131c.

步骤S131b:从多条输出文本中选择第一条确定为训练文本。Step S131b: Select the first one from the plurality of output texts and determine it as the training text.

本申请实施例中,可以将步骤S131b中确定训练文本的方式作为第二策略。In this embodiment of the present application, the method of determining the training text in step S131b may be used as the second strategy.

由于多条输出文本中的第一条通常与引导文本的匹配程度更高,因此,选择第一条输出文本作为训练文本对待训练模型进行训练,能够使得训练得到的文本识别模型识别出正例文本的准确度更高。Since the first one of the multiple output texts usually has a higher degree of matching with the guiding text, selecting the first output text as the training text to train the model to be trained enables the trained text recognition model to recognize the positive text higher accuracy.

步骤S131c:从多条输出文本中随机选择一条文本确定为训练文本。Step S131c: Randomly select a text from a plurality of output texts and determine it as a training text.

本实施方式中,也可以将各条输出文本均确定为训练文本,或者,可以将各输出文本进行显示,以便于用户从各输出文本中进行选择,再将用户所选择的文本确定为训练文本。本申请不具体限定从多条输出文本中确定训练文本的方式。In this embodiment, each output text may also be determined as training text, or each output text may be displayed so that the user can select from each output text, and then the text selected by the user is determined as training text . The present application does not specifically limit the manner of determining the training text from multiple pieces of output text.

本实施方式中,由于基于语义的文本模型得出的输出文本包括多条,这样,可以从多条输出文本中灵活确定出与目标文本的语义属性更一致的输出文本作为训练文本,从而使得确定出的训练文本与目标文本的语义属性更一致,使得通过训练文本确定出的文本识别模型能够更准确地识别出正例文本。In this embodiment, since the output text obtained by the semantic-based text model includes multiple pieces of output text, it is possible to flexibly determine the output text that is more consistent with the semantic attributes of the target text from the multiple pieces of output text as the training text, thereby making it possible to determine The semantic attributes of the training text and the target text are more consistent, so that the text recognition model determined by the training text can more accurately identify the positive text.

在一种实施方式中,如图2所示,选择上述第一策略确定训练文本的概率为第一预设概率,选择上述第二策略确定训练文本的概率为第二预设概率,第一预设概率大于第二预设概率,且第一预设概率与第二预设概率之和为1。In one embodiment, as shown in FIG. 2 , the probability that the first strategy is selected to determine the training text is the first preset probability, the probability that the second strategy is selected to determine the training text is the second preset probability, and the first preset probability It is assumed that the probability is greater than the second preset probability, and the sum of the first preset probability and the second preset probability is 1.

也就是说,本实施方式以第一预设概率选择上述第一策略确定训练文本,以第二预设概率选择上述第二策略确定训练文本。That is, in this embodiment, the first strategy is selected with a first preset probability to determine the training text, and the second strategy is selected with a second preset probability to determine the training text.

第一预设概率大于第二预设概率,可以是第一预设概率、第二预设概率分别为0.8、0.2,也可以是第一预设概率、第二预设概率分别为0.9、0.1,也可以是第一预设概率、第二预设概率分别为0.6、0.4等,第一预设概率大于第二预设概率、且二者之和为1即可,本申请不限定两个概率的具体值。The first preset probability is greater than the second preset probability, and the first preset probability and the second preset probability may be 0.8 and 0.2 respectively, or the first preset probability and the second preset probability may be 0.9 and 0.1 respectively. , or the first preset probability and the second preset probability are 0.6, 0.4, etc., respectively, the first preset probability is greater than the second preset probability, and the sum of the two is 1. This application does not limit two The specific value of the probability.

在一个具体实施例中,第一预设概率的范围可以为0.7~0.9,第二预设概率的范围可以为0.1~0.3。例如,第一预设概率、第二预设概率分别为0.7、0.3,第一预设概率、第二预设概率分别为0.8、0.2,第一预设概率、第二预设概率分别为0.9、0.1。也就是说,第一预设概率相对于第二预设概率之间的差距比较大。通常来说,包含预设关键词的输出文本与目标文本语义属性一致的概率比不包含预设关键词但与目标文本的语义属性一致的概率大的比较多,因此,第一预设概率比第二预设概率大的程度比较多,这样,既可以更大程度地使得到的训练文本与目标文本的语义属性一致,也可以使得到的训练文本的多样性更好,从而使通过训练文本训练出的文本识别模型能够更准确地识别出更多样的正例文本。In a specific embodiment, the range of the first preset probability may be 0.7-0.9, and the range of the second preset probability may be 0.1-0.3. For example, the first preset probability and the second preset probability are respectively 0.7 and 0.3, the first preset probability and the second preset probability are respectively 0.8 and 0.2, and the first preset probability and the second preset probability are respectively 0.9 , 0.1. That is to say, the difference between the first preset probability and the second preset probability is relatively large. Generally speaking, the probability that the output text containing the preset keywords is consistent with the semantic attributes of the target text is much greater than the probability that the output text does not contain the preset keywords but is consistent with the semantic attributes of the target text. Therefore, the first preset probability ratio The second preset probability is relatively large, so that the obtained training text can be more consistent with the semantic attributes of the target text, and the diversity of the obtained training text can be better, so that the training text can be passed through the training text. The trained text recognition model can more accurately identify more diverse positive texts.

由于预设关键词数量有限,各预设关键词很难包含所有与目标文本的语义属性相一致的词,因此,即使输出文本不包含任何的预设关键词,该输出文本也有可能与目标文本的语义属性一致。由于第一条输出文本与目标文本语义属性相一致的概率是比较大的,因此,无论第一条输出文本是否包含预设关键词,该第一条输出文本也有可能与目标文本的语义属性相一致,本实施例以较小的第二预设概率选择第一条输出文本作为训练文本,可以增加不包含预设关键词、但与目标文本的语义属性相一致的训练文本,使得训练文本的多样性更好,从而使得训练后的文本识别模型可以识别出更多样的正例文本。Due to the limited number of preset keywords, it is difficult for each preset keyword to contain all words that are consistent with the semantic attributes of the target text. Therefore, even if the output text does not contain any preset keywords, the output text may be consistent with the target text. have the same semantic properties. Since the probability that the first output text is consistent with the semantic attributes of the target text is relatively large, no matter whether the first output text contains preset keywords or not, the first output text may also be consistent with the semantic attributes of the target text. Consistent, in this embodiment, the first output text is selected as the training text with a smaller second preset probability, and training texts that do not contain preset keywords but are consistent with the semantic attributes of the target text can be added, so that the The diversity is better, so that the trained text recognition model can recognize more diverse positive texts.

当输出文本包含预设关键词时,说明该输出文本与目标文本的语义属性相一致的概率很大,因此,以较大的第一预设概率将包含至少一个预设关键词的输出文本确定为训练文本,可以使得确定出的训练文本与目标文本的语义属性更一致,从而使得通过训练文本训练得到的文本识别模型能够更准确地识别出正例文本。When the output text contains a preset keyword, it means that the probability that the output text is consistent with the semantic attributes of the target text is very high. Therefore, the output text containing at least one preset keyword is determined with a larger first preset probability. In order to train the text, the determined training text can be made more consistent with the semantic attributes of the target text, so that the text recognition model obtained by training the training text can more accurately identify the positive example text.

上述预设关键词可以包括各个上述引导词。由于引导文本包含的引导词与目标文本的语义相一致,所以,直接将各引导词确定为预设关键词,快速得到预设关键词。The above-mentioned preset keywords may include each of the above-mentioned guide words. Since the guide words contained in the guide text are consistent with the semantics of the target text, each guide word is directly determined as a preset keyword, and the preset keyword is obtained quickly.

在一个具体实施例中,上述预设关键词还可以包括各第一目标词,第一目标词为任意一条输出文本中包含的、与目标文本语义属性相一致、且与各引导词均不同的词。In a specific embodiment, the above-mentioned preset keywords may also include first target words, and the first target words are contained in any output text, are consistent with the semantic attributes of the target text, and are different from each guiding word. word.

本实施例中,可以将多条输出文本进行显示,以使用户能够查看各条输出文本。用户查看各条输出文本后,能够从各条输出文本中找出与目标文本语义属性相一致、且与各引导词均不同的词,并将找出的词输入电子设备,这样,电子设备能够获取到用户找出并输入的词。In this embodiment, multiple pieces of output text may be displayed, so that the user can view each piece of output text. After viewing each piece of output text, the user can find words from each piece of output text that are consistent with the semantic attributes of the target text and different from each guide word, and input the found words into the electronic device, so that the electronic device can Get the words that the user found and typed.

由于输出文本是基于引导文本生成的,因此,输出文本很大概率与目标文本语义属性一致,所以,输出文本中含有与目标文本语义属性一致的词的概率也很大,这样,从输出文本中很可能筛选出与目标文本语义属性一致、且与各引导词均不同的词。Since the output text is generated based on the guide text, the output text has a high probability of being consistent with the semantic attributes of the target text. Therefore, the probability that the output text contains words that are consistent with the semantic attributes of the target text is also high. In this way, from the output text It is possible to filter out words that are consistent with the semantic properties of the target text and are different from each leading word.

由于引导词的数量有限,所以根据输出文本是否包含引导词来确定输出文本是否为训练文本,可能会过滤掉一些包含其他与目标文本语义一致的词的文本。本实施例中将第一目标词确定为预设关键词,即对预设关键词进行了补充,预设关键词更加丰富多样,这样,不容易漏掉输出文本中与目标文本一致的文本,使得训练样本更多样丰富。Due to the limited number of guiding words, it is possible to filter out some texts containing other words that are semantically consistent with the target text to determine whether the output text is training text according to whether the output text contains guiding words. In this embodiment, the first target word is determined as the preset keyword, that is, the preset keyword is supplemented, and the preset keyword is more abundant and diverse. In this way, it is not easy to miss the text consistent with the target text in the output text. Make the training samples more diverse and rich.

在一个具体实施例中,步骤S131a之后,还可以包括以下步骤S131c。In a specific embodiment, after step S131a, the following step S131c may also be included.

步骤S131c:当多条输出文本中的每一条文本均未包含任一预设关键词时,将多条输出文本中的第一条确定为训练样本。Step S131c: when each of the multiple output texts does not contain any preset keyword, determine the first one of the multiple output texts as a training sample.

本申请实施例中,可以将步骤S131a和步骤S131c确定训练文本的方式共同确定为第一策略。In this embodiment of the present application, the manner in which the training text is determined in steps S131a and S131c may be jointly determined as the first strategy.

本实施例能够提高获取到的与目标文本的语义属性一致的文本的数量。This embodiment can increase the number of acquired texts that are consistent with the semantic attributes of the target text.

本申请实施例中,用户可以将步骤S110~步骤S130进行多次,每次进行时的引导文本均与文本生成模型识别出的正例文本的语义属性一致,但每次进行时的引导文本不完全相同,例如,每次进行时的引导文本可以部分不相同,也可以完全不相同,这样,通过执行多次训练文本生成方法,可以得到更多不相同的训练文本。通过训练文本训练后的文本识别模型可以用于识别谩骂、黄色等敏感文本。可见,通过本申请提供的方法能够生成多样的训练文本。In this embodiment of the present application, the user can perform steps S110 to S130 multiple times, and the guiding text each time is consistent with the semantic attributes of the positive example text recognized by the text generation model, but the guiding text each time is different. They are exactly the same, for example, the guiding texts may be partially or completely different each time the process is performed, so that by executing the training text generation method multiple times, more different training texts can be obtained. The text recognition model trained by training text can be used to identify sensitive texts such as abuse and yellowishness. It can be seen that various training texts can be generated by the method provided in this application.

如图3所示,本申请第二实施例提供了一种文本识别模型的训练方法,该方法包括以下步骤S510~步骤S520。As shown in FIG. 3 , a second embodiment of the present application provides a method for training a text recognition model, and the method includes the following steps S510 to S520.

步骤S510:获取训练样本。Step S510: Obtain training samples.

上述训练样本包括正例样本和负样本,正例样本对应的文本包括:通过第一实施例中任一项所述的训练文本生成方法所生成的训练文本。The above-mentioned training samples include positive samples and negative samples, and the texts corresponding to the positive samples include: training texts generated by any one of the training text generating methods described in the first embodiment.

本申请实施例中,可以将通过第一实施例中任一项所述的训练文本生成方法所生成的训练文本标记为正例样本,从而得到正例样本。In this embodiment of the present application, the training text generated by any one of the training text generating methods described in the first embodiment may be marked as a positive sample, thereby obtaining a positive sample.

上述负例样本对应的文本可以从小说、杂志、网页等资料中获取。负例样本对应的文本与正例样本对应的文本的语义属性相反。例如,正例样本对应的文本为语义敏感的文本,则负例样本对应的文本为语义不敏感的文本,正例样本对应的文本为包含人名的文本,则负例样本对应的文本为不包含人名的文本。The text corresponding to the above negative samples can be obtained from novels, magazines, web pages and other materials. The text corresponding to the negative samples has opposite semantic properties to the text corresponding to the positive samples. For example, the text corresponding to the positive sample is semantically sensitive text, the text corresponding to the negative sample is semantically insensitive text, the text corresponding to the positive sample is the text containing the name of the person, then the text corresponding to the negative sample does not contain The text of the person's name.

当文本识别模型用于识别出对话生成模型生成的文本是否是敏感文本时,上述正例样本对应的文本还可以包括:从历史聊天信息中被过滤掉的敏感文本中确定的第三文本。例如,对于智能聊天系统来说,其可能已经运行了一段时间,运行过程中已经通过基于正则表达式的过滤方式、基于字或词的过滤方式等方式过滤了一部分文本,这些被过滤掉的文本为敏感文本的概率是非常大的,因此,从这些被过滤掉的敏感文本中可以快速获取到大量的敏感文本。When the text recognition model is used to identify whether the text generated by the dialogue generation model is sensitive text, the text corresponding to the positive sample may further include: a third text determined from the sensitive text filtered out of the historical chat information. For example, for an intelligent chat system, it may have been running for a period of time, and some texts have been filtered through regular expression-based filtering, word- or word-based filtering, etc., and these filtered texts The probability of being sensitive text is very large, so a large number of sensitive texts can be quickly obtained from these filtered sensitive texts.

具体的,可以人工从上述被过滤掉的敏感文本中进行选择敏感的第三文本,电子设备将人工所选择的第三文本确定为正例样本对应的文本,并将第三文本标记文正例样本。Specifically, a sensitive third text can be manually selected from the above-mentioned filtered sensitive texts, and the electronic device determines the manually selected third text as the text corresponding to the positive sample, and marks the third text as the positive sample. .

上述正例样本对应的文本还可以包括:从历史聊天信息中未被过滤掉的文本中确定的第四文本。历史聊天信息中未被过滤掉的文本也可能包含敏感文本,所以,可以人工从上述未被过滤掉的文本中选择出敏感的第四文本,电子设备将人工选择的第四文本确定为正例样本对应的文本,并将第四文本标记文正例样本。The text corresponding to the positive sample may further include: a fourth text determined from unfiltered texts in the historical chat information. The unfiltered texts in the historical chat information may also contain sensitive texts. Therefore, a sensitive fourth text can be manually selected from the above unfiltered texts, and the electronic device determines the manually selected fourth text as a positive example. The text corresponding to the sample, and the fourth text is marked with the text positive sample.

通过以上方式可以快速得到大量的正例样本。Through the above method, a large number of positive samples can be quickly obtained.

当文本识别模型用于识别出对话生成模型生成的文本是否是敏感文本时,上述负例样本对应的文本可以包括:从历史聊天信息中未被过滤掉的文本中确定的文本。由于历史聊天信息中非敏感的文本占比更大,数量更多,因此,从未被过滤掉的文本中确定的文本基本上都为非敏感文本,这样,可以方便地得到大量的非敏感文本,即负例样本对应的文本。电子设备可以将负例样本对应的文本标记为负例样本。When the text recognition model is used to identify whether the text generated by the dialogue generation model is sensitive text, the text corresponding to the negative sample may include: text determined from unfiltered texts in historical chat information. Since the non-sensitive texts in the historical chat information account for a larger proportion and the number is larger, the texts determined from the unfiltered texts are basically non-sensitive texts, so that a large number of non-sensitive texts can be easily obtained. , that is, the text corresponding to the negative sample. The electronic device may mark the text corresponding to the negative example as a negative example.

步骤S520:使用上述训练样本对待训练模型进行训练,得到文本识别模型。Step S520: Use the above training samples to train the model to be trained to obtain a text recognition model.

待训练模型的具体类型可以参考第一实施例的描述,此处不再赘述。For the specific type of the model to be trained, reference may be made to the description of the first embodiment, which will not be repeated here.

本申请实施例中,可以将正例样本对应的文本、负例样本对应的文本输入待训练模型进行编码,并用二分类器来对编码后的文本进行分类,分类结果为正例文本(例如敏感文本、包含人名的文本)或负例文本(例如非敏感文本、不包含人名的文本),将得到的分类结果与正、负例样本对应的文本自身的标记信息进行比较,从而调整待训练模型的各个参数。In the embodiment of the present application, the text corresponding to the positive sample and the text corresponding to the negative sample can be input into the model to be trained for encoding, and the encoded text is classified by a binary classifier, and the classification result is the positive text (for example, sensitive text, text containing names) or negative texts (such as non-sensitive texts, texts not containing names), compare the classification results obtained with the label information of the texts corresponding to positive and negative samples to adjust the model to be trained of each parameter.

训练过程中,损失函数可以为通用的二分类交叉熵损失函数。During the training process, the loss function can be a general binary classification cross-entropy loss function.

本申请实施例中,待训练模型可以包括文本编码模型和二分类模型。In this embodiment of the present application, the model to be trained may include a text encoding model and a binary classification model.

本申请提供的文本识别模型的训练方法由于采用了第一实施例所述的训练文本生成方法生成正例样本对应的文本,因此,具有与第一实施例相对应的有益效果,此处不再赘述。Since the training method of the text recognition model provided by the present application adopts the training text generation method described in the first embodiment to generate the text corresponding to the positive sample, it has beneficial effects corresponding to the first embodiment, which is not repeated here. Repeat.

在一中实施方式中,如图4所示,所述训练方法还可以包括以下步骤 S530~S540。In one embodiment, as shown in FIG. 4 , the training method may further include the following steps S530-S540.

步骤S530:获取第一文本,并对第一文本进行标注,得到第一样本。Step S530: Acquire the first text, and annotate the first text to obtain a first sample.

第一文本为文本识别模型识别错误的文本,所述识别错误的文本的实际语义属性与所述文本识别模型对所述识别错误的文本所识别出的语义属性不同。例如,“曹操是三国时期的历史人物”这句话的实际语义属性是非敏感的语义,而文本识别模型对这句话识别出的语义属性是敏感的语义,则这句话为文本识别模型识别错误的文本。The first text is an incorrectly recognized text by the text recognition model, and the actual semantic attribute of the incorrectly recognized text is different from the semantic attribute recognized by the text recognition model for the incorrectly recognized text. For example, the actual semantic attribute of the sentence "Cao Cao was a historical figure in the Three Kingdoms period" is insensitive semantics, but the semantic attribute recognized by the text recognition model for this sentence is sensitive semantics, then this sentence is recognized by the text recognition model. wrong text.

步骤S540:使用第一样本对文本识别模型进行优化训练。Step S540: Use the first sample to perform optimization training on the text recognition model.

具体的,第一文本为训练后的文本识别模型在执行文本识别的过程中识别错误的文本,本申请可以人工对文本识别模型识别后的文本进行检查,以发现识别错的文本。Specifically, the first text is the text recognized by the trained text recognition model in the process of performing text recognition, and the present application can manually check the text recognized by the text recognition model to find the text recognized by mistake.

本申请中,可以人工对上述第一文本进行标注,例如,标注为正例样本或负例样本。In this application, the above-mentioned first text can be manually marked, for example, marked as a positive sample or a negative sample.

本实施方式中,可以按预设的时间间隔对文本识别模型进行优化训练,预设的时间间隔可以是一个月、两个月或者其他时间间隔。或者,也可以响应与用户触发的优化指令而对文本识别模型进行优化训练。In this embodiment, the text recognition model can be optimized and trained at preset time intervals, and the preset time interval can be one month, two months, or other time intervals. Alternatively, the text recognition model can also be optimized and trained in response to an optimization instruction triggered by the user.

本实施方式使用被文本识别模型识别错误的文本对文本识别模型进行优化训练,可以进一步提高文本识别模型的识别准确率。In this embodiment, the text recognition model is optimized and trained by using the wrongly recognized text by the text recognition model, which can further improve the recognition accuracy of the text recognition model.

在一个具体实施例中,如图4所示,在上述使用第一样本对文本识别模型进行优化训练的步骤之前,上述训练方法还可以包括以下步骤S550。In a specific embodiment, as shown in FIG. 4 , before the above-mentioned step of using the first sample to optimize the training of the text recognition model, the above-mentioned training method may further include the following step S550.

步骤S550:获取第二文本,并对第二文本进行标注,得到第二样本。Step S550: Acquire the second text, and mark the second text to obtain a second sample.

其中,第二文本中包含第二目标词,且第二文本与第一文本所表达的语义属性相反,第二目标词为第一文本中包含的、与目标文本所表达的语义属性相一致的词。第二样本与第一样本的标注信息相反。Wherein, the second text contains a second target word, and the second text is opposite to the semantic attribute expressed by the first text, and the second target word is contained in the first text and is consistent with the semantic attribute expressed by the target text. word. The label information of the second sample is opposite to that of the first sample.

上述步骤S540可以按以下步骤S541实现。The above step S540 can be implemented by the following step S541.

步骤S541:使用第一样本和第二样本对所述文本识别模型进行优化训练。Step S541: Use the first sample and the second sample to perform optimization training on the text recognition model.

本实施例中,可以人工确定第二样本,并将第二样本输入电子设备,从而使电子设备获取到该第二样本。In this embodiment, the second sample can be manually determined, and the second sample can be input into the electronic device, so that the electronic device can acquire the second sample.

例如,对于文本识别模型用于识别出敏感文本的场景,文本识别模型预测错误的第一文本为“曹操是三个有名的人物”,文本识别模型将其预测为敏感文本,但该第一文本实际为非敏感文本,因此将“曹操是三个有名的人物”标记为敏感文本,即标记为正例样本,第一文本中包含的与目标文本所表达的语义属性相一致的词为“操”,即“操”为第二目标词,则包含“操”、且与“曹操是三个有名的人物”这一非敏感文本的语义属性相反的第二文本可以为“你想怎么操”,该第二文本为敏感文本,其与第一文本的语义属性相反,则将“你想怎么操”标记为非敏感文本,即标记为负例样本。For example, for the scene where the text recognition model is used to recognize sensitive text, the text recognition model predicts the wrong first text as "Cao Cao is three famous people", the text recognition model predicts it as sensitive text, but the first text It is actually a non-sensitive text, so "Cao Cao is three famous people" is marked as a sensitive text, that is, marked as a positive sample, and the words contained in the first text that are consistent with the semantic attributes expressed by the target text are "Cao Cao". ”, that is, “Cao” is the second target word, then the second text that contains “Cao” and is opposite to the semantic attribute of the non-sensitive text “Cao Cao is three famous people” can be “How do you want to fuck” , the second text is sensitive text, which is opposite to the semantic properties of the first text, so "how do you want to do it" is marked as non-sensitive text, that is, marked as a negative example.

由于“曹操是三个有名的人物”中包含语义敏感的词“操”,所以文本识别模型将该句识别为敏感文本。本实施例将“曹操是三个有名的人物”标记为非敏感文本对模型优化训练时,容易使模型认为包含“操”的文本为非敏感文本,这种情况下,为了避免模型过拟合到某些字词上,可以增加一些与第一样本标注信息相反、且包含第二目标词“操”的第二样本,通过第一样本和第二样本共同对模型进行优化训练,更好地避免了模型过拟合到某些词上的现象,提高了模型识别的准确度。Since "Cao Cao is three famous people" contains the semantically sensitive word "Cao", the text recognition model recognizes this sentence as a sensitive text. In this example, "Cao Cao is three famous people" is marked as insensitive text. When the model is optimized for training, it is easy for the model to think that the text containing "Cao" is insensitive text. In this case, in order to avoid model overfitting To some words, you can add some second samples that are opposite to the labeling information of the first sample and contain the second target word "動", and optimize the training of the model through the first sample and the second sample. The phenomenon that the model is over-fitted to some words is well avoided, and the accuracy of model recognition is improved.

在一种实施方式中,上述训练样本可以包括回复样本以及问答拼接样本。In one embodiment, the above-mentioned training samples may include reply samples and question-and-answer splicing samples.

回复样本中正例样本对应的文本包括:通过上述基于引导的对话生成模型生成的文本;问答拼接样本对应的文本为拼接文本,拼接文本包括:将提问文本与对应于该提问文本的回复文本进行拼接后形成的文本。The text corresponding to the positive sample in the reply sample includes: the text generated by the above-mentioned guidance-based dialogue generation model; the text corresponding to the question and answer splicing sample is spliced text, and the spliced text includes: splicing the question text and the reply text corresponding to the question text text formed later.

本实施方式中,回复样本中正例样本对应的文本还可以包括:从智能聊天设备中获取的回复文本、从小说或剧本的对白中确定的回复文本等。In this embodiment, the text corresponding to the positive sample in the reply sample may further include: reply text obtained from the intelligent chatting device, reply text determined from the dialogue of a novel or a script, and the like.

回复样本中负例样本对应的文本可以包括:从小说、剧本、网络文章等资源中确定的文本,回复样本中负例样本对应的文本还可以包括:通过对话生成模型所生成的文本。The text corresponding to the negative sample in the reply sample may include: text determined from resources such as novels, scripts, and online articles, and the text corresponding to the negative sample in the reply sample may also include: text generated by a dialogue generation model.

用户可以人工对回复样本中的正例样本和负例样本进行标注。The user can manually annotate the positive samples and negative samples in the reply samples.

本实施方式中,回复样本对应的文本用于表示对用户提问的问题所做的答复。In this embodiment, the text corresponding to the reply sample is used to represent the reply to the question asked by the user.

上述拼接文本能够反映用户的问题与回复者的回复进行拼接后的信息。The above spliced text can reflect the information obtained by splicing the user's question and the respondent's reply.

可以人工从小说、剧本、网络文章等资源中获取对白信息,再从这些对白信息中获取提问文本和对应于该提问文本的回复文本。Dialogue information can be manually obtained from resources such as novels, scripts, and online articles, and then a question text and a reply text corresponding to the question text can be obtained from these dialogue information.

或者,也可以通过对话生成模型生成与提问文本对应的回复文本,并将该提问文本与对话生成模型生成的文本确定为回复文本。Alternatively, a reply text corresponding to the question text may also be generated by the dialogue generation model, and the question text and the text generated by the dialogue generation model may be determined as the reply text.

例如,提问文本为“你想我吗”,回复文本为“我想你”,则拼接文本可以为“你想我吗,我想你”。For example, if the question text is "do you miss me" and the reply text is "I miss you", the concatenated text can be "do you miss me, I miss you".

由于有些用户可能会刻意聊一些敏感话题,这个时候如果单独识别聊天设备回复的文本,其可能并不是敏感文本,但是将用户的问题和智能回复设备的回复连起来看可能就是敏感文本。如用户提问为“你想要我吗”,聊天设备回复为“嗯,我要”,回复文本“嗯,我要”并不是敏感文本,但“你想要我吗,嗯,我要”为敏感文本。本实施例将问答拼接样本作为训练样本,使得训练得到的文本识别模型能够用于对问答拼接文本进行识别,从而使得文本识别模型的适用范围更广,能够识别出的目标文本的类型更多。Since some users may deliberately chat about some sensitive topics, at this time, if the text replied by the chat device is identified separately, it may not be sensitive text, but the user's question and the reply from the smart reply device may be connected to look like sensitive text. If the user asks "do you want me", the chat device replies with "um, I want", the reply text "um, I want" is not sensitive text, but "do you want me, um, I want" is Sensitive text. In this embodiment, the question and answer splicing sample is used as a training sample, so that the text recognition model obtained by training can be used to recognize the question and answer splicing text, so that the text recognition model has a wider application range and can identify more types of target texts.

第二实施例中,主要对与第一实施例不同的部分进行了解释说明,与第一实施例相同或相似部分的内容不再赘述。In the second embodiment, the parts different from the first embodiment are mainly explained and described, and the content of the same or similar parts as the first embodiment will not be repeated.

本申请第三实施例提供了一种文本识别方法,包括以下步骤:A third embodiment of the present application provides a text recognition method, comprising the following steps:

获取待识别文本;Get the text to be recognized;

将上述待识别文本输入文本识别模型中,得到对待识别文本的识别结果。The above text to be recognized is input into the text recognition model, and the recognition result of the text to be recognized is obtained.

其中,上述文本识别模型是通过第二实施例中任一项所述的训练方法进行训练得到的。Wherein, the above-mentioned text recognition model is obtained by training through any one of the training methods described in the second embodiment.

上述待识别文本可以为对话生成模型所生成的文本。The above text to be recognized may be text generated by a dialogue generation model.

上述待识别文本也可以为将用户的提问文本以及对话生成模型所生成的文本进行拼接后形成的文本。这种情况下,训练文本识别模型的训练样本包括上述回复样本以及上述问答拼接样本。The above-mentioned text to be recognized may also be a text formed by splicing the user's question text and the text generated by the dialogue generation model. In this case, the training samples for training the text recognition model include the above-mentioned reply samples and the above-mentioned question-and-answer splicing samples.

上述待识别文本也可以为网络留言、聊天工具上的对话信息、网络文章中的文本等,本申请不限定待识别文本的具体内容。The above text to be recognized may also be an online message, dialogue information on a chat tool, text in an online article, etc. The application does not limit the specific content of the text to be recognized.

现有技术的方法包括敏感词识别法与基于规则的识别法,使用本申请提供的文本识别方法能够很好地对隐晦敏感文本进行识别,且不容易出现误判现象,对敏感文本的识别准确率更高。The methods in the prior art include a sensitive word recognition method and a rule-based recognition method. Using the text recognition method provided by the present application, the obscure and sensitive text can be well recognized, and the phenomenon of misjudgment is not easy to occur, and the recognition of sensitive text is accurate. higher rate.

本第三实施例的文本识别方法中的文本识别模型由于采用了第二实施例所提供的方法训练得到,因此,该实施例具有与第二实施例相似的有益效果,此处不再赘述。The text recognition model in the text recognition method of the third embodiment is obtained by training with the method provided by the second embodiment. Therefore, this embodiment has similar beneficial effects as the second embodiment, which will not be repeated here.

第三实施例中,主要对与第一实施例、第二实施例不同的部分进行了解释说明,与第一实施例、第二实施例相同或相似部分的内容不再赘述。In the third embodiment, the different parts from the first embodiment and the second embodiment are mainly explained, and the content of the same or similar parts as the first embodiment and the second embodiment will not be repeated.

如图5所示,本申请第四实施例还提供了一种训练文本生成装置,所述训练文本用于对待训练模型进行训练,以得到文本识别模型,所述装置包括:As shown in FIG. 5 , the fourth embodiment of the present application also provides a training text generation device, the training text is used to train a model to be trained to obtain a text recognition model, and the device includes:

信息获取单元810,用于获取引导文本,所述引导文本与目标文本的语义属性相一致,所述目标文本为所述文本识别模型识别出的正例文本;an information acquisition unit 810, configured to acquire guide text, the guide text being consistent with the semantic attributes of the target text, and the target text is the positive example text recognized by the text recognition model;

文本生成单元820,用于将所述引导文本输入基于引导的文本生成模型中,得到与所述引导文本的语义属性相一致的输出文本;A text generation unit 820, configured to input the guidance text into a guidance-based text generation model to obtain output text consistent with the semantic attributes of the guidance text;

文本确定单元830,用于根据所述输出文本确定训练文本。A text determination unit 830, configured to determine training text according to the output text.

可选地,所述装置还包括:Optionally, the device further includes:

第一文本获取单元,用于获取提问文本;a first text obtaining unit, used for obtaining the question text;

所述文本生成单元,具体用于:将所述提问文本和所述引导文本输入基于引导的对话生成模型中,得到用于回复所述提问文本、且与所述引导文本语义属性相一致的输出文本。The text generation unit is specifically configured to: input the question text and the guidance text into a guidance-based dialogue generation model, and obtain an output that is used to reply to the question text and is consistent with the semantic attributes of the guidance text text.

可选地,所述输出文本包括多条;Optionally, the output text includes multiple pieces;

文本确定单元具体用于:从多条所述输出文本中确定训练文本。The text determining unit is specifically used for: determining training text from a plurality of the output texts.

可选地,文本确定单元具体用于:Optionally, the text determination unit is specifically used for:

通过第一策略确定训练文本,所述第一策略包括:从多条所述输出文本中选择包含至少一个预设关键词的文本作为训练文本,所述预设关键词与所述目标文本的语义属性相一致;The training text is determined by a first strategy, the first strategy includes: selecting a text containing at least one preset keyword from a plurality of the output texts as the training text, the preset keyword and the semantics of the target text properties are consistent;

或者,通过第二策略确定训练文本,所述第二策略包括:从多条所述输出文本中选择第一条文本或随机选择一条文本作为训练文本。Alternatively, the training text is determined through a second strategy, where the second strategy includes: selecting a first text from a plurality of the output texts or randomly selecting a text as the training text.

可选地,选择所述第一策略确定所述训练文本的概率为第一预设概率,选择所述第二策略确定所述训练文本的概率为第二预设概率,所述第一预设概率大于所述第二预设概率,且所述第一预设概率与所述第二预设概率之和为1。Optionally, selecting the first strategy to determine the probability of the training text is a first preset probability, and selecting the second strategy to determine the probability of the training text is a second preset probability, the first preset probability. The probability is greater than the second preset probability, and the sum of the first preset probability and the second preset probability is 1.

可选地,所述第一预设概率的范围可以为0.7~0.9,所述第二预设概率的范围可以为0.1~0.3。Optionally, the range of the first preset probability may be 0.7-0.9, and the range of the second preset probability may be 0.1-0.3.

可选地,所述引导文本包括至少一个引导词,每一所述引导词与所述目标文本的语义属性相一致;Optionally, the guide text includes at least one guide word, and each of the guide words is consistent with the semantic attribute of the target text;

所述预设关键词包括:各所述引导词。The preset keywords include: each of the guide words.

可选地,所述预设关键词还包括:各第一目标词,所述第一目标词为任意一条所述输出文本中包含的、与所述目标文本语义属性相一致、且与各所述引导词均不同的词。Optionally, the preset keywords further include: each first target word, the first target word is contained in any piece of the output text, is consistent with the semantic attributes of the target text, and is consistent with each Describe words with different leading words.

可选地,所述第一策略还包括:当多条所述输出文本均未包含任一所述预设关键词时,选择多条所述输出文本中的第一条以确定训练文本。Optionally, the first strategy further includes: when none of the multiple pieces of the output text contains any of the preset keywords, selecting the first piece of the multiple pieces of the output text to determine the training text.

可选地,所述正例文本的语义属性为语义敏感的文本,所述目标文本的语义属性为语义敏感的文本,所述文本识别模型用于对对话生成模型所生成的文本进行识别。Optionally, the semantic attribute of the positive example text is semantically sensitive text, the semantic attribute of the target text is semantically sensitive text, and the text recognition model is used to recognize the text generated by the dialogue generation model.

本申请第五实施例还提供了一种文本识别模型的训练装置,包括:The fifth embodiment of the present application also provides a training device for a text recognition model, including:

样本获取单元,用于获取训练样本,所述训练样本包括正例样本和负例样本,所述正例样本对应的文本包括:通过第四实施例中任一项所述的训练文本生成装置所生成的训练文本;A sample acquisition unit, configured to acquire training samples, the training samples include positive samples and negative samples, and the text corresponding to the positive samples includes: generated by the training text generating apparatus described in any one of the fourth embodiments; Generated training text;

模型训练单元,用于使用所述训练样本对待训练模型进行训练,得到文本识别模型。A model training unit, configured to use the training samples to train the model to be trained to obtain a text recognition model.

可选地,所述训练装置还包括:Optionally, the training device further includes:

第二文本获取单元,用于获取第一文本,所述第一文本为所述文本识别模型识别错误的文本,所述识别错误的文本的实际语义属性与所述文本识别模型对所述识别错误的文本所识别出的语义属性不同;A second text acquisition unit, configured to acquire a first text, where the first text is the text that the text recognition model has identified incorrectly, and the actual semantic attribute of the incorrectly recognized text is related to the recognition error of the text by the text recognition model. The semantic attributes recognized by the text are different;

样本标注单元,用于对所述第一文本进行标注,得到第一样本;a sample labeling unit, configured to label the first text to obtain a first sample;

模型优化单元,用于使用所述第一样本对所述文本识别模型进行优化训练。A model optimization unit, configured to perform optimization training on the text recognition model by using the first sample.

可选地,所述第二文本获取单元还用于:Optionally, the second text acquisition unit is further configured to:

获取第二文本,所述第二文本中包含第二目标词,且所述第二文本与所述第一文本所表达的语义属性相反,所述第二目标词为所述第一文本中包含的、与所述目标文本所表达的语义属性相一致的词;Obtain a second text, the second text contains a second target word, and the second text is opposite to the semantic attribute expressed by the first text, and the second target word is contained in the first text. words that are consistent with the semantic attributes expressed by the target text;

所述样本标注单元还用于:对所述第二文本进行标注,得到第二样本,所述第二样本与所述第一样本的标注信息相反;The sample labeling unit is further configured to: label the second text to obtain a second sample, where the labeling information of the second sample is opposite to that of the first sample;

所述模型优化单元具体用于:使用所述第一样本和所述第二样本对所述文本识别模型进行优化训练。The model optimization unit is specifically configured to: use the first sample and the second sample to perform optimization training on the text recognition model.

可选地,所述训练样本包括回复样本以及问答拼接样本;Optionally, the training samples include reply samples and question and answer splicing samples;

所述回复样本中正例样本对应的文本包括:通过第一实施例中通过将提问文本和引导文本输入基于引导的对话生成模型中的方式确定出的训练文本;The text corresponding to the positive sample in the reply sample includes: the training text determined by inputting the question text and the guide text into the guide-based dialogue generation model in the first embodiment;

所述问答拼接样本对应的文本为拼接文本,所述拼接文本包括:将提问文本与对应于该提问文本的回复文本进行拼接后形成的文本。The text corresponding to the question and answer splicing sample is a spliced text, and the spliced text includes: a text formed by splicing the question text and the reply text corresponding to the question text.

本申请第六实施例还提供了一种文本识别装置,包括:The sixth embodiment of the present application also provides a text recognition device, including:

第三文本获取单元,用于获取待识别文本;a third text acquisition unit, used for acquiring the text to be recognized;

文本识别单元,用于将所述待识别文本输入文本识别模型中,得到对所述待识别文本的识别结果,其中,所述文本识别模型是通过第五实施例中任一项所述的训练装置进行训练得到的。A text recognition unit, configured to input the text to be recognized into a text recognition model, and obtain a recognition result of the text to be recognized, wherein the text recognition model is trained by any one of the fifth embodiments device for training.

可选地,所述待识别文本为对话生成模型所生成的文本;Optionally, the text to be recognized is text generated by a dialogue generation model;

或者,所述待识别文本为将用户的提问文本以及对话生成模型所生成的文本进行拼接后形成的文本,其中,所述文本识别模型是通过第一实施例中当训练样本包括回复样本以及问答拼接样本时所述的训练方法进行训练得到的。Or, the text to be recognized is the text formed by splicing the user's question text and the text generated by the dialogue generation model, wherein the text recognition model is obtained by using the training samples in the first embodiment including reply samples and question and answer samples. It is obtained by training with the training method described when splicing samples.

与本申请第一实施例提供的训练文本生成方法相对应的,本申请第七实施例还提供了一种用于生成训练文本的电子设备。如图6所示,所述电子设备包括:处理器901;以及存储器902,用于存储训练文本生成方法的程序,该设备通电并通过所述处理器运行该数据变更响应方法的程序后,执行如下步骤:Corresponding to the training text generation method provided in the first embodiment of the present application, the seventh embodiment of the present application further provides an electronic device for generating training text. As shown in FIG. 6 , the electronic device includes: a processor 901; and a memory 902 for storing a program of a training text generation method. After the device is powered on and runs the program of the data change response method through the processor, executes Follow the steps below:

获取引导文本,所述引导文本与目标文本的语义属性相一致,所述目标文本为所述文本识别模型识别出的正例文本;Obtaining guide text, the guide text is consistent with the semantic attribute of the target text, and the target text is the positive example text identified by the text recognition model;

将所述引导文本输入基于引导的文本生成模型中,得到与所述引导文本的语义属性相一致的输出文本;Inputting the guidance text into a guidance-based text generation model to obtain output text consistent with the semantic attributes of the guidance text;

根据所述输出文本确定训练文本。Training text is determined from the output text.

与本申请第二实施例提供的文本识别模型的训练方法相对应的,本申请第八实施例还提供了一种用于训练文本识别模型的电子设备。所述电子设备包括:处理器;以及存储器,用于存储文本识别模型的训练方法的程序,该设备通电并通过所述处理器运行该文本识别模型的训练方法的程序后,执行如下步骤:Corresponding to the method for training a text recognition model provided by the second embodiment of the present application, the eighth embodiment of the present application further provides an electronic device for training a text recognition model. The electronic device comprises: a processor; and a memory for storing the program of the training method of the text recognition model, after the device is powered on and runs the program of the training method of the text recognition model through the processor, the following steps are performed:

获取训练样本,所述训练样本包括正例样本和负例样本,所述正例样本对应的文本包括:通过第一实施例中任一项所述的训练文本生成方法所生成的训练文本;Acquiring training samples, the training samples include positive samples and negative samples, and the text corresponding to the positive samples includes: training text generated by the training text generating method described in any one of the first embodiments;

使用所述训练样本对待训练模型进行训练,得到文本识别模型。The model to be trained is trained using the training samples to obtain a text recognition model.

与本申请第三实施例提供的文本识别方法相对应的,本申请第九实施例还提供了一种用于对文本进行识别的电子设备。所述电子设备包括:处理器;以及存储器,用于存储文本识别方法的程序,该设备通电并通过所述处理器运行该文本识别方法的程序后,执行如下步骤:Corresponding to the text recognition method provided in the third embodiment of the present application, the ninth embodiment of the present application further provides an electronic device for recognizing text. The electronic device comprises: a processor; and a memory for storing the program of the text recognition method, after the device is powered on and runs the program of the text recognition method through the processor, the following steps are performed:

获取待识别文本;Get the text to be recognized;

将所述待识别文本输入文本识别模型中,得到对所述待识别文本的识别结果,其中,所述文本识别模型是通过第二实施例中任一项所述的训练方法进行训练得到的。The to-be-recognized text is input into a text recognition model, and a recognition result of the to-be-recognized text is obtained, wherein the text recognition model is obtained by training using the training method described in any one of the second embodiments.

与本申请第一实施例提供的训练文本生成方法相对应的,本申请第十实施例提供一种计算机可读存储介质,存储有训练文本生成方法的程序,该程序被处理器运行,执行下述步骤:Corresponding to the training text generation method provided by the first embodiment of the present application, the tenth embodiment of the present application provides a computer-readable storage medium storing a program of the training text generation method, the program is run by a processor, and executes the following: Describe the steps:

获取引导文本,所述引导文本与目标文本的语义属性相一致,所述目标文本为所述文本识别模型识别出的正例文本;Obtaining guide text, the guide text is consistent with the semantic attribute of the target text, and the target text is the positive example text identified by the text recognition model;

将所述引导文本输入基于引导的文本生成模型中,得到与所述引导文本的语义属性相一致的输出文本;Inputting the guidance text into a guidance-based text generation model to obtain output text consistent with the semantic attributes of the guidance text;

根据所述输出文本确定训练文本。Training text is determined from the output text.

需要说明的是,对于本申请第四实施例至第十实施例中提供的装置、电子设备及计算机可读存储介质实施例,详细描述可以参考对本申请第一实施例至第三实施例的相关描述,这里不再赘述。It should be noted that, for the embodiments of the apparatus, electronic device, and computer-readable storage medium provided in the fourth embodiment to the tenth embodiment of the present application, for a detailed description, reference may be made to the related information of the first embodiment to the third embodiment of the present application. description, which will not be repeated here.

本申请虽然以较佳实施例公开如上,但其并不是用来限定本申请,任何本领域技术人员在不脱离本申请的精神和范围内,都可以做出可能的变动和修改,因此本申请的保护范围应当以本申请权利要求所界定的范围为准。Although the present application is disclosed above with preferred embodiments, it is not intended to limit the present application. Any person skilled in the art can make possible changes and modifications without departing from the spirit and scope of the present application. Therefore, the present application The scope of protection shall be subject to the scope defined by the claims of this application.

在一个典型的配置中,电子设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, an electronic device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器 (RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。Memory may include non-persistent memory in computer readable media, random access memory (RAM) and/or non-volatile memory in the form of, for example, read only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

1、计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存 (PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他属性的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储介质或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括非暂存电脑可读媒体(transitory media),如调制的数据信号和载波。1. Computer readable media includes both persistent and non-permanent, removable and non-removable media. Information storage can be implemented by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), random access memory (RAM) of other attributes, read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape magnetic disk storage or other magnetic storage media or any other non-transmission medium that can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, excludes non-transitory computer-readable media, such as modulated data signals and carrier waves.

2、本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。2. Those skilled in the art should understand that the embodiments of the present application may be provided as methods, systems or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本申请虽然以较佳实施例公开如上,但其并不是用来限定本申请,任何本领域技术人员在不脱离本申请的精神和范围内,都可以做出可能的变动和修改,因此本申请的保护范围应当以本申请权利要求所界定的范围为准。Although the present application is disclosed above with preferred embodiments, it is not intended to limit the present application. Any person skilled in the art can make possible changes and modifications without departing from the spirit and scope of the present application. Therefore, the present application The scope of protection shall be subject to the scope defined by the claims of this application.

Claims (18)

1. A method for generating a training text is characterized in that the training text is used for training a model to be trained to obtain a text recognition model, and the method comprises the following steps:
acquiring a guide text, wherein the semantic attributes of the guide text are consistent with those of a target text, and the target text is a regular text identified by the text identification model;
inputting the guide text into a guide-based text generation model to obtain an output text consistent with the semantic attribute of the guide text;
and determining a training text according to the output text.
2. The method of claim 1, wherein prior to the entering the guide text into a guide-based text generation model, the method further comprises:
obtaining a question text;
inputting the guide text into a guide-based text generation model to obtain an output text consistent with the semantic attribute of the guide text, wherein the method comprises the following steps:
and inputting the question text and the guide text into a guide-based dialog generation model to obtain an output text which is used for replying the question text and is consistent with the semantic attribute of the guide text.
3. The method of claim 2, wherein the output text comprises a plurality of pieces;
the determining a training text according to the output text includes:
a training text is determined from a plurality of the output texts.
4. The method of claim 3, wherein determining a training text from the plurality of output texts comprises:
determining a training text by a first strategy, the first strategy comprising: selecting a text containing at least one preset keyword from the output texts as a training text, wherein the preset keyword is consistent with the semantic attribute of the target text;
or determining the training text through a second strategy, wherein the second strategy comprises the following steps: and selecting a first text from the output texts or randomly selecting a text as a training text.
5. The method according to claim 4, wherein selecting the first strategy determines the probability of the training text to be a first preset probability, selecting the second strategy determines the probability of the training text to be a second preset probability, the first preset probability is greater than the second preset probability, and the sum of the first preset probability and the second preset probability is 1.
6. The method of claim 4, wherein the guide text comprises at least one guide word, each guide word being consistent with a semantic attribute of the target text;
the preset keywords include: each of the guidance words.
7. The method of claim 6, wherein the preset keywords further comprise: and each first target word is a word which is consistent with the semantic attribute of the target text and is different from each leading word.
8. The method of claim 4, wherein the first policy further comprises: and when the output texts do not contain any preset keyword, selecting a first output text of the output texts to determine a training text.
9. The method according to any one of claims 1 to 8, wherein the semantic attribute of the regular text is a semantically sensitive text, the semantic attribute of the target text is a semantically sensitive text, and the text recognition model is used for recognizing the text generated by the dialogue generation model.
10. A training method of a text recognition model is characterized by comprising the following steps:
obtaining training samples, wherein the training samples comprise positive example samples and negative example samples, and texts corresponding to the positive example samples comprise: a training text generated by the training text generation method of any one of claims 1 to 9;
and training the model to be trained by using the training sample to obtain a text recognition model.
11. The training method of claim 10, further comprising:
acquiring a first text, wherein the first text is a text which is recognized by the text recognition model and has an error, and the actual semantic attribute of the text which is recognized by the error is different from the semantic attribute recognized by the text recognition model for the text which is recognized by the error;
labeling the first text to obtain a first sample;
and performing optimization training on the text recognition model by using the first sample.
12. A training method as defined in claim 11, wherein prior to the optimally training the text recognition model using the first sample, the training method further comprises:
acquiring a second text, wherein the second text comprises a second target word, the semantic attribute of the second text is opposite to that of the first text, and the second target word is a word contained in the first text and consistent with the semantic attribute expressed by the target text;
labeling the second text to obtain a second sample, wherein the labeling information of the second sample is opposite to that of the first sample;
the optimally training the text recognition model using the first sample comprises:
optimally training the text recognition model using the first and second samples.
13. Training method according to any of claims 10 to 12, wherein the training samples comprise reply samples and question-answer splice samples;
the text corresponding to the normal sample in the reply sample comprises: a text generated by the training text generation method of any one of claims 2 to 8;
the text corresponding to the question-answer stitching sample is a stitching text, and the stitching text comprises: and splicing the question text and the reply text corresponding to the question text to form a text.
14. A text recognition method, comprising:
acquiring a text to be identified;
inputting the text to be recognized into a text recognition model to obtain a recognition result of the text to be recognized, wherein the text recognition model is obtained by training through the training method of any one of claims 10 to 13.
15. The text recognition method according to claim 14, wherein the text to be recognized is a text generated by a dialogue generating model;
or, the text to be recognized is a text formed by splicing a question text of a user and a text generated by a dialogue generating model, where the text recognition model is obtained by training through the training method of claim 13.
16. A training text generation apparatus, wherein the training text is used for training a model to be trained to obtain a text recognition model, the apparatus comprising:
the information acquisition unit is used for acquiring a guide text, the semantic attributes of the guide text are consistent with those of a target text, and the target text is a regular text recognized by the text recognition model;
the text generation unit is used for inputting the guide text into a guide-based text generation model to obtain an output text consistent with the semantic attribute of the guide text;
and the text determining unit is used for determining a training text according to the output text.
17. An electronic device, comprising:
a processor; and
a memory for storing a data processing program which, when powered on and executed by said processor, performs the method of any one of claims 1 to 15.
18. A computer-readable storage medium, in which a data processing program is stored, which program, when executed by a processor, performs the method according to any one of claims 1-15.
CN202210535272.6A 2022-05-17 2022-05-17 Training text generation method, model training device and electronic equipment Pending CN115129866A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210535272.6A CN115129866A (en) 2022-05-17 2022-05-17 Training text generation method, model training device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210535272.6A CN115129866A (en) 2022-05-17 2022-05-17 Training text generation method, model training device and electronic equipment

Publications (1)

Publication Number Publication Date
CN115129866A true CN115129866A (en) 2022-09-30

Family

ID=83376082

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210535272.6A Pending CN115129866A (en) 2022-05-17 2022-05-17 Training text generation method, model training device and electronic equipment

Country Status (1)

Country Link
CN (1) CN115129866A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116542297A (en) * 2023-07-03 2023-08-04 深圳须弥云图空间科技有限公司 Method and device for generating countermeasure network based on text data training
CN116628198A (en) * 2023-05-08 2023-08-22 之江实验室 A training method, device, medium and electronic equipment for a text generation model
CN116701926A (en) * 2023-05-16 2023-09-05 阿里巴巴(中国)有限公司 Sample generation model construction, sample generation, sample detection methods

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783455A (en) * 2020-07-13 2020-10-16 网易(杭州)网络有限公司 Training method and device of text generation model and text generation method and device
CN111897964A (en) * 2020-08-12 2020-11-06 腾讯科技(深圳)有限公司 Text classification model training method, device, equipment and storage medium
US10853580B1 (en) * 2019-10-30 2020-12-01 SparkCognition, Inc. Generation of text classifier training data
CN112528677A (en) * 2020-12-22 2021-03-19 北京百度网讯科技有限公司 Training method and device of semantic vector extraction model and electronic equipment
WO2021126388A1 (en) * 2019-12-18 2021-06-24 Microsoft Technology Licensing, Llc Controllable grounded text generation
CN113642305A (en) * 2021-07-22 2021-11-12 北京三快在线科技有限公司 Text generation method and device, storage medium and electronic equipment
CN113988157A (en) * 2021-09-30 2022-01-28 北京百度网讯科技有限公司 Semantic retrieval network training method and device, electronic equipment and storage medium
WO2022033332A1 (en) * 2020-08-14 2022-02-17 腾讯科技(深圳)有限公司 Dialogue generation method and apparatus, network training method and apparatus, storage medium, and device
WO2022095368A1 (en) * 2020-11-04 2022-05-12 平安科技(深圳)有限公司 Question-answer corpus generation method and device based on text generation model

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10853580B1 (en) * 2019-10-30 2020-12-01 SparkCognition, Inc. Generation of text classifier training data
WO2021126388A1 (en) * 2019-12-18 2021-06-24 Microsoft Technology Licensing, Llc Controllable grounded text generation
CN111783455A (en) * 2020-07-13 2020-10-16 网易(杭州)网络有限公司 Training method and device of text generation model and text generation method and device
CN111897964A (en) * 2020-08-12 2020-11-06 腾讯科技(深圳)有限公司 Text classification model training method, device, equipment and storage medium
WO2022033332A1 (en) * 2020-08-14 2022-02-17 腾讯科技(深圳)有限公司 Dialogue generation method and apparatus, network training method and apparatus, storage medium, and device
WO2022095368A1 (en) * 2020-11-04 2022-05-12 平安科技(深圳)有限公司 Question-answer corpus generation method and device based on text generation model
CN112528677A (en) * 2020-12-22 2021-03-19 北京百度网讯科技有限公司 Training method and device of semantic vector extraction model and electronic equipment
CN113642305A (en) * 2021-07-22 2021-11-12 北京三快在线科技有限公司 Text generation method and device, storage medium and electronic equipment
CN113988157A (en) * 2021-09-30 2022-01-28 北京百度网讯科技有限公司 Semantic retrieval network training method and device, electronic equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116628198A (en) * 2023-05-08 2023-08-22 之江实验室 A training method, device, medium and electronic equipment for a text generation model
CN116701926A (en) * 2023-05-16 2023-09-05 阿里巴巴(中国)有限公司 Sample generation model construction, sample generation, sample detection methods
CN116542297A (en) * 2023-07-03 2023-08-04 深圳须弥云图空间科技有限公司 Method and device for generating countermeasure network based on text data training

Similar Documents

Publication Publication Date Title
US11977854B2 (en) Computer implemented methods for the automated analysis or use of data, including use of a large language model
US12353827B2 (en) Computer implemented methods for the automated analysis or use of data, including use of a large language model
US12067362B2 (en) Computer implemented methods for the automated analysis or use of data, including use of a large language model
US20190347563A1 (en) Tailoring Question Answering System Output Based on User Expertise
US9317594B2 (en) Social community identification for automatic document classification
US20160071022A1 (en) Machine Learning Model for Level-Based Categorization of Natural Language Parameters
US20130159277A1 (en) Target based indexing of micro-blog content
CN115129866A (en) Training text generation method, model training device and electronic equipment
CN109791549A (en) Machine customer interaction towards dialogue
WO2012126259A1 (en) System having information distributing and searching functions and information distribution method
US20130096910A1 (en) Method and system for adapting text content to the language behavior of an online community
Yang et al. Graphusion: Leveraging large language models for scientific knowledge graph fusion and construction in nlp education
CN118520854A (en) Text generation method, apparatus, computer device, storage medium, and program product
Setiawan et al. The Optimization of n-Gram Feature Extraction Based on Term Occurrence for Cyberbullying Classification
CN118606452A (en) Question and answer method, device, electronic device and storage medium based on sandbox analysis report
CN118013025A (en) Method, apparatus, device and medium for book search
Llorens et al. Automatic system for identifying and categorizing temporal relations in natural language
CN114328902A (en) Text labeling model construction method and device
CN113505889B (en) Processing method and device of mapping knowledge base, computer equipment and storage medium
CN118052221B (en) Text processing method, device, equipment, storage medium and product
CN115878752A (en) Text emotion analysis method, device, equipment, medium and program product
Khan et al. Identifying artificial intelligence-generated content using the DistilBERT transformer and NLP techniques
Šošić et al. Effective methods for email classification: Is it a business or personal email?
Pamila et al. Natural language processing based identification of Related Short Forum Posts Through Knowledge Based Conceptualization
Winston et al. Natural Language Processing (NLP) for Detecting Fake Profiles via Content Analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination