CN115033703A

CN115033703A - A relation extraction method for aquatic animals and disease texts

Info

Publication number: CN115033703A
Application number: CN202210248668.2A
Authority: CN
Inventors: 张思佳; 姜鑫; 喻文甫; 毕甜甜; 沙明洋; 王梓铭; 刘明剑
Original assignee: Dalian Ocean University
Current assignee: Dalian Ocean University
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2022-09-09

Abstract

The invention discloses a method for extracting entity semantic relationship of aquatic animal disease texts, comprising: collecting aquatic animal disease texts, labeling text data with a labeling tool, inputting the labelled data set into a BERT model, and automatically acquiring the semantic features of words , and represent and extract deep-level semantics, obtain the second text, embed the label information into the joint space of the words and labels of the second text, and carry out joint learning with each word, output the third text, and input the third text into the Bilstm model Carry out learning, obtain the correlation and context information of long-distance words, get the fourth text, send the fourth text to the Attention layer, reduce the loss of key information in the text sequence, obtain the fifth text, and input the fifth text into the CRF layer , and obtain the result of joint extraction of aquatic animal disease text entity relations. This method can effectively solve the problem of inaccurate overlapping relation extraction in text-level relation extraction.

Description

A relation extraction method for aquatic animals and disease texts

技术领域technical field

本发明涉及水产疾病防治技术领域，更具体的涉及一种水产动物疾病文本的实体语义关系抽取方法。The invention relates to the technical field of aquatic disease prevention, and more particularly to a method for extracting entity semantic relations from aquatic animal disease texts.

背景技术Background technique

在水产养殖过程中，水产动物的疾病是影响养殖户经济的一大因素，通过将水生动物疾病领域知识与计算机相结合，构建水生动物疾病知识图谱，使养殖户在水产病害发生时能够得到及时准确诊断，正确得当处治意见。关系抽取是知识图谱构建的重要前期工作之一，将无结构的文本转化成格式统一的关系数据，将文本数据中的特征进行提取，具有重要的意义。In the process of aquaculture, diseases of aquatic animals are a major factor affecting the economy of farmers. By combining the knowledge of aquatic animal diseases with computers, a knowledge map of aquatic animal diseases is constructed, so that farmers can get timely information when aquatic diseases occur. Accurate diagnosis, correct and appropriate treatment advice. Relation extraction is one of the important preliminary work of knowledge graph construction. It is of great significance to convert unstructured text into relational data in a unified format and to extract features from text data.

Zheng等首次提出基于新标注策略的实体关系联合抽取方法。该方法把包含命名实体识别与关系分类两个任务的联合学习模型转变成序列标注问题，取得很好的效果(ZHENG S,HAO Y,LU D,et al.Joint entity and relation extraction based on ahybrid neural network[J].Neurocomputing,2016,257.)。张玉坤等在药品说明书语料库中，把卷积神经网络与支持向量机、条件随机场相结合，构建了联合神经网络模型，取得了不错的效果(张玉坤,刘茂福,胡慧君.基于联合神经网络模型的中文医疗实体分类与关系抽取[J].计算机工程与科学, 2019,41(06):1110-1118.)。在水稻病虫草害领域，沈利言等设计了一种基于新标注模式的双长短期记忆网络与注意力机制结合的水稻病虫草害与药剂的实体关系联合抽取算法，解决了文本中含有大量实体没有明确边界以及药剂与病虫草害实体之间存在大量多关系的技术问题并得到了不错的效果(沈利言,姜海燕,胡滨,等.水稻病虫草害与药剂实体关系联合抽取算法[J].南京农业大学学报,2020,43(06):1151-1161.)。在金融领域，唐晓波等结合金融文本特征提出了新的序列标注模式并构建了基于BERT的金融领域实体关系联合抽取模型，实现了对金融文本中实体间重叠关系的识别，F值达到了54.3％(唐晓波, 刘志源.金融领域文本序列标注与实体关系联合抽取研究[J].情报科学, 2021,39(05):3-11.)。在医疗领域，曹明宇等提出了一种基于神经网络的药物实体与关系联合抽取方法，使用了一种新标注模式，将药物实体及关系的联合抽取转化为端对端的序列标注任务，F值达到了67.3％(曹明宇,杨志豪,罗凌, 等.基于神经网络的药物实体与关系联合抽取[J].计算机研究与发展, 2019,56(07):1432-1440.)。然而，上述这些方法限制了捕获长跨度句子中实体语义信息，它们不能从篇章级的关系示例中提取一些新的有效特征。Zheng et al. first proposed a joint entity-relation extraction method based on a new labeling strategy. This method transforms the joint learning model including the two tasks of named entity recognition and relation classification into a sequence labeling problem, and achieves good results (ZHENG S, HAO Y, LU D, et al. Joint entity and relation extraction based on ahybrid neural network[J]. Neurocomputing, 2016, 257.). In the corpus of drug instructions, Zhang Yukun et al. combined convolutional neural network with support vector machine and conditional random field to construct a joint neural network model, and achieved good results (Zhang Yukun, Liu Maofu, Hu Huijun. Based on the joint neural network model Chinese medical entity classification and relation extraction [J]. Computer Engineering and Science, 2019, 41(06):1110-1118.). In the field of rice diseases, insects and weeds, Shen Liyan et al. designed a joint extraction algorithm for the entity relationship between rice diseases, insects and weeds and pesticides based on a new labeling model that combines dual long-term and short-term memory networks with an attention mechanism. The technical problems of clarifying the boundaries and the existence of a large number of multi-relationships between the pesticides and the entities of pests and diseases have achieved good results (Shen Liyan, Jiang Haiyan, Hu Bin, et al. Joint extraction algorithm for the relationship between rice pests, pests and pesticide entities [J]. Journal of Nanjing Agricultural University, 2020, 43(06):1151-1161.). In the financial field, Tang Xiaobo et al. proposed a new sequence labeling mode based on the characteristics of financial texts and constructed a BERT-based entity relationship joint extraction model in the financial field, which realized the identification of overlapping relationships between entities in financial texts, and the F value reached 54.3% (Tang Xiaobo, Liu Zhiyuan. Research on Joint Extraction of Text Sequence Labeling and Entity Relationship in Financial Field [J]. Information Science, 2021,39(05):3-11.). In the medical field, Cao Mingyu et al. proposed a neural network-based method for joint extraction of drug entities and relationships, using a new labeling model to convert the joint extraction of drug entities and relationships into an end-to-end sequence labeling task, with an F value of 67.3% (Cao Mingyu, Yang Zhihao, Luo Ling, et al. Joint extraction of drug entities and relationships based on neural network [J]. Computer Research and Development, 2019, 56(07): 1432-1440.). However, these methods above are limited in capturing entity semantic information in long-span sentences, and they cannot extract some new and effective features from discourse-level relational examples.

发明内容SUMMARY OF THE INVENTION

本发明实施例提供一种水产动物疾病文本的实体语义关系抽取方法，包括：The embodiment of the present invention provides a kind of entity semantic relation extraction method of aquatic animal disease text, including:

收集水产动物疾病文本，构建水产动物疾病语料库；Collect aquatic animal disease texts and construct an aquatic animal disease corpus;

采用标注工具对文本数据集使用标注工具对文本数据标注；Use the labeling tool to label the text data set Use the labeling tool to label the text data;

将标注完的数据集输入BERT模型，自动获取词语语义上的特征、并表示和抽取深层次语义，得到第二文本；Input the marked data set into the BERT model, automatically obtain the semantic features of words, express and extract deep semantics, and obtain the second text;

对第二文本进行标签嵌入，将标签信息嵌入第二文本的词和标签的联合空间、并与每个字进行联合学习，输出第三文本；Tag embedding is performed on the second text, the tag information is embedded in the joint space of the words and tags of the second text, and joint learning is performed with each word, and the third text is output;

将联合学习的第三文本输入Bilstm模型进行学习，对学习到的标签嵌入层的输出信息进一步语义编码，获取长距离词的相关性和上下文信息，得到第四文本；Input the third text of joint learning into the Bilstm model for learning, further semantically encode the output information of the learned label embedding layer, obtain the correlation and context information of long-distance words, and obtain the fourth text;

将第四文本送入到Attention层，在大量信息中集中注意力地处理有用信息，减少文本序列中关键信息的丢失，获得第五文本；The fourth text is sent to the Attention layer, and the useful information is processed with concentrated attention in a large amount of information, the loss of key information in the text sequence is reduced, and the fifth text is obtained;

将第五文本输入CRF层，得到最终的预测标签序列，进而得到水产动物疾病文本实体关系联合抽取的结果。Input the fifth text into the CRF layer to obtain the final predicted label sequence, and then obtain the result of joint extraction of the entity relationship of aquatic animal disease text.

进一步，还包括对收集到的水产动物疾病文本进行数据预处理，其包括：Further, it also includes data preprocessing on the collected aquatic animal disease texts, including:

通过用Python语句对网络上水产疾病网站进行数据爬取；Data crawling of aquatic disease websites on the Internet by using Python statements;

整合文献、书籍上的数据；Integrate data from literature and books;

清洗无用数据。Clean useless data.

进一步，还包括将语料库中的语料分成两部分，一部分为训练集一部分为测试集，采用标注工具对训练集中的文本数据进行标注。Further, it also includes dividing the corpus in the corpus into two parts, one part is the training set and the other is the test set, and the text data in the training set is annotated with an annotation tool.

进一步，采用标注工具对训练集中的文本数据进行标注的标注方法，包括：Further, the labeling method for labeling the text data in the training set with labeling tools includes:

疾病的标签设为固定标签，B-H-1表示实体头部，I-H-1表示实体中间部分；The label of the disease is set as a fixed label, B-H-1 represents the head of the entity, and I-H-1 represents the middle part of the entity;

实体标签均采用HB表示该实体元素的头部，HI表示该实体元素的中间部分，O则表示该元素不属于任何实体。Entity tags all use HB to represent the head of the entity element, HI to represent the middle part of the entity element, and O to indicate that the element does not belong to any entity.

进一步，得到最终的预测标签序列的步骤，包括：Further, the steps of obtaining the final predicted label sequence include:

设定输入序列X＝(X₁，X₂，...，X_n)；Set the input sequence X=(X ₁ , X ₂ , . . . , X _n );

获得Attention层输出概率矩阵P；Obtain the output probability matrix P of the Attention layer;

CRF层输出的标注序列Y＝(Y₁，Y₂，...，Y_n)；The labeling sequence Y=(Y ₁ , Y ₂ , . . . , Y _n ) output by the CRF layer;

根据下面公式计算预测序列得分S(X，Y)，得分最高的序列为最终的输出序列；Calculate the predicted sequence score S(X, Y) according to the following formula, and the sequence with the highest score is the final output sequence;

其中，A_yi,yi+1表示概率中转移矩阵由标注Y_i转移到标注Y_i+1的概率，P_i，yi表示被X_i标注为Y_i的概率。Among them, A _yi,yi+1 represents the probability that the transition matrix in the probability is transferred from the label Y _i to the label Y _i+1 , and P _{i, yi} represents the probability that X _i is labelled as Y _i .

本发明实施例提供一种水产动物疾病文本的实体语义关系抽取方法，与现有技术相比，其有益效果如下：The embodiment of the present invention provides a kind of entity semantic relation extraction method of aquatic animal disease text, compared with prior art, its beneficial effect is as follows:

本发明是为了解决篇章级水产领域文本关系抽取中抽取结果不准确的问题，而提出的一种水产动物疾病文本的实体语义关系抽取方法，可有效地解决篇章级关系抽取中重叠关系抽取不准确的问题。与现有的关系抽取方法相比，本发明在水产疾病文本数据集上获得了最佳性能，在准确率和召回值上都有所提升，在此基础上，有效提高了水产疾病文本关系抽取的F1值。In order to solve the problem of inaccurate extraction results in text relation extraction in the textual level aquatic products field, the present invention proposes a method for extracting entity semantic relation of aquatic animal disease text, which can effectively solve the inaccurate extraction of overlapping relation in text level relation extraction The problem. Compared with the existing relationship extraction method, the present invention obtains the best performance on the aquatic disease text data set, and improves both the accuracy rate and the recall value, and on this basis, effectively improves the aquatic disease text relationship extraction. the F1 value.

附图说明Description of drawings

图1为本发明实施例提供的一种水产动物疾病文本的实体语义关系抽取方法的流程示意图；Fig. 1 is a kind of schematic flowchart of the entity semantic relationship extraction method of aquatic animal disease text provided by the embodiment of the present invention;

图2为本发明实施例提供的一种水产动物疾病文本的实体语义关系抽取方法的标注结果图。Fig. 2 is an annotation result diagram of a method for extracting entity semantic relations of aquatic animal disease texts provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative efforts shall fall within the protection scope of the present invention.

参见图1～2，本发明实施例提供一种水产动物疾病文本的实体语义关系抽取方法，该方法包括：Referring to Figures 1-2, an embodiment of the present invention provides a method for extracting entity semantic relations from aquatic animal disease text, the method comprising:

本发明的一种水产动物疾病文本的实体语义关系抽取方法，是将待抽取的水产动物疾病文本送入到网络中进行抽取，其特征在于所述网络是按照图1所示步骤构建：A kind of entity semantic relation extraction method of aquatic animal disease text of the present invention, is to send the aquatic animal disease text to be extracted into the network and extract, it is characterized in that described network is constructed according to the steps shown in Figure 1:

步骤1.收集水产动物疾病文本，对收集到的水产动物疾病文本进行数据清洗、数据预处理，构成水产动物疾病语料库；通过用Python语句对网络上水产疾病网站进行数据爬取，对文献、书籍上的数据进行整合、将无用数据进行清洗，对于一些从公众号和网站上获取的水产动物疾病语料，在爬取下来的页面中存在一些与水产动物疾病无关的字符、符号等，还会存在一些干扰词，如“上一页”“下一页”等。将下载文献PDF格式转换成WORD格式后的文字会有错别字，要讲这些干扰字符、干扰词、错别字去除和修改。整合完后构成水产动物疾病语料库。Step 1. Collect aquatic animal disease texts, perform data cleaning and data preprocessing on the collected aquatic animal disease texts, and form an aquatic animal disease corpus; by using Python statements to crawl the aquatic disease website on the Internet, the literature and books are collected. For some aquatic animal disease corpus obtained from public accounts and websites, there are some characters, symbols, etc. that are not related to aquatic animal diseases in the crawled pages. Some noise words, such as "previous page", "next page", etc. After converting the downloaded document PDF format into WORD format, there will be typos in the text. It is necessary to talk about these interference characters, interference words, and typo removal and modification. After the integration, the aquatic animal disease corpus is formed.

步骤2.将语料库中的语料分成两部分，一部分为训练集一部分为测试集，采用标注工具对训练集中的文本数据进行标注；标注方法为本文研究的是水产疾病的关系，所以均是疾病与其他实体元素之间的关系，疾病的标签设为固定标签，B-H-1表示实体头部，I-H-1表示实体中间部分。其余实体标签均用HB 表示该实体元素的头部，HI表示该实体元素的中间部分，O则表示该元素不属于任何实体；本文在分析水产疾病语料后将主要关系分为5类，在实际标注中取其汉语拼音首字母大写作为对应的实体标签；实体角色标签由数字“1”和“2” 表示，代表实体在所抽取的三元组中的先后顺序。在本文抽取的只是三元组中，对不同关系的实体角色标签的规定为：在本文应用的水产疾病领域语料中，除疾病外实体角色用“1”表示，其余关系的实体角色均用“2”来标注。本文用H+B/I 来表示角色1与角色2之间的重叠关系。标注结果如图1所示。Step 2. Divide the corpus into two parts, one part is the training set and the other is the test set, and the text data in the training set is annotated with an annotation tool; the annotation method is that this paper studies the relationship between aquatic diseases, so both diseases and The relationship between other entity elements, the label of the disease is set as a fixed label, B-H-1 represents the entity head, and I-H-1 represents the middle part of the entity. The rest of the entity labels use HB to represent the head of the entity element, HI to represent the middle part of the entity element, and O to represent that the element does not belong to any entity. In the labeling, the first letter of the Chinese pinyin is used as the corresponding entity label; the entity role label is represented by numbers "1" and "2", which represent the order of entities in the extracted triples. In this paper, only triples are extracted, and the provisions for the entity role labels of different relationships are: In the aquatic disease field corpus applied in this paper, the entity roles except diseases are represented by "1", and the entity roles of other relationships are represented by "1". 2" to mark. This paper uses H+B/I to represent the overlapping relationship between role 1 and role 2. The labeling results are shown in Figure 1.

步骤3：将标注完的数据集采用BERT模型进行数据预训练，自动获取词语语义上的特征，并对深层次语义进行表示和抽取；Step 3: Use the BERT model to pre-train the marked data set, automatically obtain the semantic features of the words, and express and extract the deep-level semantics;

步骤4：对预训练结束后的文本进行标签嵌入，将标签信息嵌入到词和标签的联合空间中，与每个字进行学习。标签信息直接蕴含与关系有关的信息，充分利用标签信息能够学习更好的句子表示，间接提高关系抽取准确率；Step 4: Perform label embedding on the text after pre-training, embed the label information into the joint space of words and labels, and learn with each word. The label information directly contains the information related to the relationship, making full use of the label information can learn better sentence representation and indirectly improve the accuracy of relationship extraction;

步骤5：将联合学习后的数据送入到Bilstm模型中，学习到标签嵌入层的输出信息进行进一步的语义编码，更好的获取长距离词的相关性和上下文信息；Step 5: Send the jointly learned data into the Bilstm model, learn the output information of the label embedding layer for further semantic encoding, and better obtain the correlation and context information of long-distance words;

步骤6：将经过Bilstm模型学习后的文本数据送入到Attention中，可以在大量信息中集中注意力地处理有用信息，减少文本序列中关键信息的丢失；Step 6: Send the text data learned by the Bilstm model into Attention, which can focus on processing useful information in a large amount of information and reduce the loss of key information in the text sequence;

步骤7：步骤7：最后将文本数据送入到CRF模型中，CRF层通过学习标注间的约束条件提升标注预测的准确性，从而得到最终的预测标签序列。Step 7: Step 7: Finally, the text data is sent to the CRF model, and the CRF layer improves the accuracy of label prediction by learning the constraints between labels, thereby obtaining the final predicted label sequence.

设输入序列为X＝(X1，X2，...，Xn)，Attention层输出的概率矩阵为P，输出的标注序列Y＝(Y1，Y2，...，Yn)。根据式(1)计算出其预测序列得分S(X，Y)，得分最高的序列即为最终的输出序列。Let the input sequence be X=(X1, X2,..., Xn), the probability matrix output by the Attention layer is P, and the output label sequence Y=(Y1, Y2,..., Yn). According to formula (1), the predicted sequence score S(X, Y) is calculated, and the sequence with the highest score is the final output sequence.

其中，Ayi,yi+1表示概率中转移矩阵由标注Yi转移到标注Yi+1的概率中， Pi，Yi表示被Xi标注为Yi的概率。Among them, Ayi,yi+1 indicates that the transition matrix in the probability is transferred from the label Yi to the probability of labeling Yi+1, and Pi, Yi indicates the probability that Xi is labelled as Yi.

本发明实习方法与现有技术的实体关系联合抽取效果对比图如表一所示。表1分别是从上到下分别为BiLSTM+CRF模型、BERT+BiGRU+CRF模型、 BERT+BiLSTM+CRF模型、本模型的实体关系联合抽取结果。对比结果表明 BiLSTM+CRF等模型的方法对篇章级别关系抽取抽取效果不足。本发明提出的模型更准确，并解决了篇章级关系抽取问题和大量重叠关系的问题。Table 1 shows a comparison diagram of the joint extraction effect of the practice method of the present invention and the entity relationship of the prior art. Table 1 shows the joint extraction results of BiLSTM+CRF model, BERT+BiGRU+CRF model, BERT+BiLSTM+CRF model, and this model from top to bottom. The comparison results show that the methods of BiLSTM+CRF and other models have insufficient effect on the extraction of text-level relations. The model proposed by the present invention is more accurate, and solves the problem of text-level relationship extraction and a large number of overlapping relationships.

表1抽取结果表Table 1 Extraction result table

以上公开的仅为本发明的几个具体实施例，本领域的技术人员可以对本发明实施例进行各种改动和变型而不脱离本发明的精神和范围，但是，本发明实施例并非局限于此，任何本领域的技术人员能思之的变化都应落入本发明的保护范围内。The above disclosures are only a few specific embodiments of the present invention. Those skilled in the art can make various changes and modifications to the embodiments of the present invention without departing from the spirit and scope of the present invention. However, the embodiments of the present invention are not limited thereto. , any changes that can be conceived by those skilled in the art should fall within the protection scope of the present invention.

Claims

1. an entity semantic relation extraction method of aquatic animal disease text, is characterized in that, comprises:

Collection of aquatic animal disease texts;

Use annotation tools to annotate text data;

Input the marked data set into the BERT model, automatically obtain the features of the word semantics, express and extract the deep-level semantics, and obtain the second text;

Perform label embedding on the second text, embed the label information into the joint space of the words and labels of the second text, and perform joint learning with each word, and output the third text;

Input the third text of joint learning into the Bilstm model for learning, further semantically encode the output information of the learned label embedding layer, obtain the correlation and context information of long-distance words, and obtain the fourth text;

Send the fourth text to the Attention layer, focus on processing useful information in a large amount of information, reduce the loss of key information in the text sequence, and obtain the fifth text;

Input the fifth text into the CRF layer to obtain the final predicted label sequence, and then obtain the result of joint extraction of semantic relations of aquatic animal disease text entities.

2. The entity semantic relation extraction method of a kind of aquatic animal disease text as claimed in claim 1, is characterized in that, also comprises carrying out data preprocessing to the aquatic animal disease text collected, and it comprises:

Data crawling of aquatic disease websites on the Internet by using Python statements;

Integrate data from literature and books;

Clean useless data.

3. the entity semantic relation extraction method of a kind of aquatic animal disease text as claimed in claim 1, is characterized in that, also comprises the corpus in the corpus is divided into two parts, one part is training set and the other part is test set, adopts labeling tool to Annotate the text data in the training set.

4. the entity semantic relation extraction method of a kind of aquatic animal disease text as claimed in claim 3, is characterized in that, the labeling method that described adopts labeling tool to label the text data in training set, comprising:

The label of the disease is set as a fixed label, B-H-1 represents the head of the entity, and I-H-1 represents the middle part of the entity;

Entity tags all use HB to represent the head of the entity element, HI to represent the middle part of the entity element, and O to indicate that the entity element does not belong to any entity.

5. a kind of entity semantic relation extraction method of aquatic animal disease text as claimed in claim 1 is characterized in that, the described step of obtaining final predicted label sequence, comprises:

Set the input sequence X=(X ₁ , X ₂ , . . . , X _n );

Obtain the output probability matrix P of the Attention layer;

The labeling sequence Y=(Y ₁ , Y ₂ , . . . , Y _n ) output by the CRF layer;

Calculate the predicted sequence score S(X, Y) according to the following formula, and the sequence with the highest score is the final output sequence;

Among them, A _yi,yi+1 represents the probability that the transition matrix in the probability is transferred from the label Y _i to the label Y _i+1 , and P _{i, yi} represents the probability that X _i is labelled as Y _i .