[go: up one dir, main page]

CN115033703A - A relation extraction method for aquatic animals and disease texts - Google Patents

A relation extraction method for aquatic animals and disease texts Download PDF

Info

Publication number
CN115033703A
CN115033703A CN202210248668.2A CN202210248668A CN115033703A CN 115033703 A CN115033703 A CN 115033703A CN 202210248668 A CN202210248668 A CN 202210248668A CN 115033703 A CN115033703 A CN 115033703A
Authority
CN
China
Prior art keywords
text
entity
label
aquatic animal
animal disease
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210248668.2A
Other languages
Chinese (zh)
Inventor
张思佳
姜鑫
喻文甫
毕甜甜
沙明洋
王梓铭
刘明剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Ocean University
Original Assignee
Dalian Ocean University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Ocean University filed Critical Dalian Ocean University
Priority to CN202210248668.2A priority Critical patent/CN115033703A/en
Publication of CN115033703A publication Critical patent/CN115033703A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A40/00Adaptation technologies in agriculture, forestry, livestock or agroalimentary production
    • Y02A40/80Adaptation technologies in agriculture, forestry, livestock or agroalimentary production in fisheries management
    • Y02A40/81Aquaculture, e.g. of fish

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开了一种水产动物疾病文本的实体语义关系抽取方法,包括:收集水产动物疾病文本,使用标注工具对文本数据标注,将标注完的数据集输入BERT模型,自动获取词语语义上的特征、并表示和抽取深层次语义,得到第二文本,将标签信息嵌入第二文本的词和标签的联合空间、并与每个字进行联合学习,输出第三文本,将第三文本输入Bilstm模型进行学习,获取长距离词的相关性和上下文信息,得到第四文本,将第四文本送入到Attention层,减少文本序列中关键信息的丢失,获得第五文本,将第五文本输入CRF层,得到水产动物疾病文本实体关系联合抽取的结果。该方法可有效地解决篇章级关系抽取中重叠关系抽取不准确的问题。

Figure 202210248668

The invention discloses a method for extracting entity semantic relationship of aquatic animal disease texts, comprising: collecting aquatic animal disease texts, labeling text data with a labeling tool, inputting the labelled data set into a BERT model, and automatically acquiring the semantic features of words , and represent and extract deep-level semantics, obtain the second text, embed the label information into the joint space of the words and labels of the second text, and carry out joint learning with each word, output the third text, and input the third text into the Bilstm model Carry out learning, obtain the correlation and context information of long-distance words, get the fourth text, send the fourth text to the Attention layer, reduce the loss of key information in the text sequence, obtain the fifth text, and input the fifth text into the CRF layer , and obtain the result of joint extraction of aquatic animal disease text entity relations. This method can effectively solve the problem of inaccurate overlapping relation extraction in text-level relation extraction.

Figure 202210248668

Description

一种水产动物和疾病文本关系抽取方法A relation extraction method for aquatic animals and disease texts

技术领域technical field

本发明涉及水产疾病防治技术领域,更具体的涉及一种水产动物疾病文本 的实体语义关系抽取方法。The invention relates to the technical field of aquatic disease prevention, and more particularly to a method for extracting entity semantic relations from aquatic animal disease texts.

背景技术Background technique

在水产养殖过程中,水产动物的疾病是影响养殖户经济的一大因素,通过 将水生动物疾病领域知识与计算机相结合,构建水生动物疾病知识图谱,使养 殖户在水产病害发生时能够得到及时准确诊断,正确得当处治意见。关系抽取 是知识图谱构建的重要前期工作之一,将无结构的文本转化成格式统一的关系 数据,将文本数据中的特征进行提取,具有重要的意义。In the process of aquaculture, diseases of aquatic animals are a major factor affecting the economy of farmers. By combining the knowledge of aquatic animal diseases with computers, a knowledge map of aquatic animal diseases is constructed, so that farmers can get timely information when aquatic diseases occur. Accurate diagnosis, correct and appropriate treatment advice. Relation extraction is one of the important preliminary work of knowledge graph construction. It is of great significance to convert unstructured text into relational data in a unified format and to extract features from text data.

Zheng等首次提出基于新标注策略的实体关系联合抽取方法。该方法把包 含命名实体识别与关系分类两个任务的联合学习模型转变成序列标注问题,取 得很好的效果(ZHENG S,HAO Y,LU D,et al.Joint entity and relation extraction based on ahybrid neural network[J].Neurocomputing,2016,257.)。张玉坤等在药 品说明书语料库中,把卷积神经网络与支持向量机、条件随机场相结合,构建 了联合神经网络模型,取得了不错的效果(张玉坤,刘茂福,胡慧君.基于联合 神经网络模型的中文医疗实体分类与关系抽取[J].计算机工程与科学, 2019,41(06):1110-1118.)。在水稻病虫草害领域,沈利言等设计了一种基于新 标注模式的双长短期记忆网络与注意力机制结合的水稻病虫草害与药剂的实 体关系联合抽取算法,解决了文本中含有大量实体没有明确边界以及药剂与病虫草害实体之间存在大量多关系的技术问题并得到了不错的效果(沈利言,姜 海燕,胡滨,等.水稻病虫草害与药剂实体关系联合抽取算法[J].南京农业大 学学报,2020,43(06):1151-1161.)。在金融领域,唐晓波等结合金融文本特征 提出了新的序列标注模式并构建了基于BERT的金融领域实体关系联合抽取模 型,实现了对金融文本中实体间重叠关系的识别,F值达到了54.3%(唐晓波, 刘志源.金融领域文本序列标注与实体关系联合抽取研究[J].情报科学, 2021,39(05):3-11.)。在医疗领域,曹明宇等提出了一种基于神经网络的药物 实体与关系联合抽取方法,使用了一种新标注模式,将药物实体及关系的联合 抽取转化为端对端的序列标注任务,F值达到了67.3%(曹明宇,杨志豪,罗凌, 等.基于神经网络的药物实体与关系联合抽取[J].计算机研究与发展, 2019,56(07):1432-1440.)。然而,上述这些方法限制了捕获长跨度句子中实体 语义信息,它们不能从篇章级的关系示例中提取一些新的有效特征。Zheng et al. first proposed a joint entity-relation extraction method based on a new labeling strategy. This method transforms the joint learning model including the two tasks of named entity recognition and relation classification into a sequence labeling problem, and achieves good results (ZHENG S, HAO Y, LU D, et al. Joint entity and relation extraction based on ahybrid neural network[J]. Neurocomputing, 2016, 257.). In the corpus of drug instructions, Zhang Yukun et al. combined convolutional neural network with support vector machine and conditional random field to construct a joint neural network model, and achieved good results (Zhang Yukun, Liu Maofu, Hu Huijun. Based on the joint neural network model Chinese medical entity classification and relation extraction [J]. Computer Engineering and Science, 2019, 41(06):1110-1118.). In the field of rice diseases, insects and weeds, Shen Liyan et al. designed a joint extraction algorithm for the entity relationship between rice diseases, insects and weeds and pesticides based on a new labeling model that combines dual long-term and short-term memory networks with an attention mechanism. The technical problems of clarifying the boundaries and the existence of a large number of multi-relationships between the pesticides and the entities of pests and diseases have achieved good results (Shen Liyan, Jiang Haiyan, Hu Bin, et al. Joint extraction algorithm for the relationship between rice pests, pests and pesticide entities [J]. Journal of Nanjing Agricultural University, 2020, 43(06):1151-1161.). In the financial field, Tang Xiaobo et al. proposed a new sequence labeling mode based on the characteristics of financial texts and constructed a BERT-based entity relationship joint extraction model in the financial field, which realized the identification of overlapping relationships between entities in financial texts, and the F value reached 54.3% (Tang Xiaobo, Liu Zhiyuan. Research on Joint Extraction of Text Sequence Labeling and Entity Relationship in Financial Field [J]. Information Science, 2021,39(05):3-11.). In the medical field, Cao Mingyu et al. proposed a neural network-based method for joint extraction of drug entities and relationships, using a new labeling model to convert the joint extraction of drug entities and relationships into an end-to-end sequence labeling task, with an F value of 67.3% (Cao Mingyu, Yang Zhihao, Luo Ling, et al. Joint extraction of drug entities and relationships based on neural network [J]. Computer Research and Development, 2019, 56(07): 1432-1440.). However, these methods above are limited in capturing entity semantic information in long-span sentences, and they cannot extract some new and effective features from discourse-level relational examples.

发明内容SUMMARY OF THE INVENTION

本发明实施例提供一种水产动物疾病文本的实体语义关系抽取方法,包 括:The embodiment of the present invention provides a kind of entity semantic relation extraction method of aquatic animal disease text, including:

收集水产动物疾病文本,构建水产动物疾病语料库;Collect aquatic animal disease texts and construct an aquatic animal disease corpus;

采用标注工具对文本数据集使用标注工具对文本数据标注;Use the labeling tool to label the text data set Use the labeling tool to label the text data;

将标注完的数据集输入BERT模型,自动获取词语语义上的特征、并表示 和抽取深层次语义,得到第二文本;Input the marked data set into the BERT model, automatically obtain the semantic features of words, express and extract deep semantics, and obtain the second text;

对第二文本进行标签嵌入,将标签信息嵌入第二文本的词和标签的联合空 间、并与每个字进行联合学习,输出第三文本;Tag embedding is performed on the second text, the tag information is embedded in the joint space of the words and tags of the second text, and joint learning is performed with each word, and the third text is output;

将联合学习的第三文本输入Bilstm模型进行学习,对学习到的标签嵌入层 的输出信息进一步语义编码,获取长距离词的相关性和上下文信息,得到第四 文本;Input the third text of joint learning into the Bilstm model for learning, further semantically encode the output information of the learned label embedding layer, obtain the correlation and context information of long-distance words, and obtain the fourth text;

将第四文本送入到Attention层,在大量信息中集中注意力地处理有用信 息,减少文本序列中关键信息的丢失,获得第五文本;The fourth text is sent to the Attention layer, and the useful information is processed with concentrated attention in a large amount of information, the loss of key information in the text sequence is reduced, and the fifth text is obtained;

将第五文本输入CRF层,得到最终的预测标签序列,进而得到水产动物 疾病文本实体关系联合抽取的结果。Input the fifth text into the CRF layer to obtain the final predicted label sequence, and then obtain the result of joint extraction of the entity relationship of aquatic animal disease text.

进一步,还包括对收集到的水产动物疾病文本进行数据预处理,其包括:Further, it also includes data preprocessing on the collected aquatic animal disease texts, including:

通过用Python语句对网络上水产疾病网站进行数据爬取;Data crawling of aquatic disease websites on the Internet by using Python statements;

整合文献、书籍上的数据;Integrate data from literature and books;

清洗无用数据。Clean useless data.

进一步,还包括将语料库中的语料分成两部分,一部分为训练集一部分为 测试集,采用标注工具对训练集中的文本数据进行标注。Further, it also includes dividing the corpus in the corpus into two parts, one part is the training set and the other is the test set, and the text data in the training set is annotated with an annotation tool.

进一步,采用标注工具对训练集中的文本数据进行标注的标注方法,包括:Further, the labeling method for labeling the text data in the training set with labeling tools includes:

疾病的标签设为固定标签,B-H-1表示实体头部,I-H-1表示实体中间部分;The label of the disease is set as a fixed label, B-H-1 represents the head of the entity, and I-H-1 represents the middle part of the entity;

实体标签均采用HB表示该实体元素的头部,HI表示该实体元素的中间部 分,O则表示该元素不属于任何实体。Entity tags all use HB to represent the head of the entity element, HI to represent the middle part of the entity element, and O to indicate that the element does not belong to any entity.

进一步,得到最终的预测标签序列的步骤,包括:Further, the steps of obtaining the final predicted label sequence include:

设定输入序列X=(X1,X2,...,Xn);Set the input sequence X=(X 1 , X 2 , . . . , X n );

获得Attention层输出概率矩阵P;Obtain the output probability matrix P of the Attention layer;

CRF层输出的标注序列Y=(Y1,Y2,...,Yn);The labeling sequence Y=(Y 1 , Y 2 , . . . , Y n ) output by the CRF layer;

根据下面公式计算预测序列得分S(X,Y),得分最高的序列为最终的输 出序列;Calculate the predicted sequence score S(X, Y) according to the following formula, and the sequence with the highest score is the final output sequence;

Figure BDA0003545949920000031
Figure BDA0003545949920000031

其中,Ayi,yi+1表示概率中转移矩阵由标注Yi转移到标注Yi+1的概率,Pi,yi表示被Xi标注为Yi的概率。Among them, A yi,yi+1 represents the probability that the transition matrix in the probability is transferred from the label Y i to the label Y i+1 , and P i, yi represents the probability that X i is labelled as Y i .

本发明实施例提供一种水产动物疾病文本的实体语义关系抽取方法,与现 有技术相比,其有益效果如下:The embodiment of the present invention provides a kind of entity semantic relation extraction method of aquatic animal disease text, compared with prior art, its beneficial effect is as follows:

本发明是为了解决篇章级水产领域文本关系抽取中抽取结果不准确的问 题,而提出的一种水产动物疾病文本的实体语义关系抽取方法,可有效地解决 篇章级关系抽取中重叠关系抽取不准确的问题。与现有的关系抽取方法相比, 本发明在水产疾病文本数据集上获得了最佳性能,在准确率和召回值上都有所 提升,在此基础上,有效提高了水产疾病文本关系抽取的F1值。In order to solve the problem of inaccurate extraction results in text relation extraction in the textual level aquatic products field, the present invention proposes a method for extracting entity semantic relation of aquatic animal disease text, which can effectively solve the inaccurate extraction of overlapping relation in text level relation extraction The problem. Compared with the existing relationship extraction method, the present invention obtains the best performance on the aquatic disease text data set, and improves both the accuracy rate and the recall value, and on this basis, effectively improves the aquatic disease text relationship extraction. the F1 value.

附图说明Description of drawings

图1为本发明实施例提供的一种水产动物疾病文本的实体语义关系抽取方 法的流程示意图;Fig. 1 is a kind of schematic flowchart of the entity semantic relationship extraction method of aquatic animal disease text provided by the embodiment of the present invention;

图2为本发明实施例提供的一种水产动物疾病文本的实体语义关系抽取方 法的标注结果图。Fig. 2 is an annotation result diagram of a method for extracting entity semantic relations of aquatic animal disease texts provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清 楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是 全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造 性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative efforts shall fall within the protection scope of the present invention.

参见图1~2,本发明实施例提供一种水产动物疾病文本的实体语义关系抽 取方法,该方法包括:Referring to Figures 1-2, an embodiment of the present invention provides a method for extracting entity semantic relations from aquatic animal disease text, the method comprising:

本发明的一种水产动物疾病文本的实体语义关系抽取方法,是将待抽取的 水产动物疾病文本送入到网络中进行抽取,其特征在于所述网络是按照图1所 示步骤构建:A kind of entity semantic relation extraction method of aquatic animal disease text of the present invention, is to send the aquatic animal disease text to be extracted into the network and extract, it is characterized in that described network is constructed according to the steps shown in Figure 1:

步骤1.收集水产动物疾病文本,对收集到的水产动物疾病文本进行数据 清洗、数据预处理,构成水产动物疾病语料库;通过用Python语句对网络上 水产疾病网站进行数据爬取,对文献、书籍上的数据进行整合、将无用数据进 行清洗,对于一些从公众号和网站上获取的水产动物疾病语料,在爬取下来的 页面中存在一些与水产动物疾病无关的字符、符号等,还会存在一些干扰词, 如“上一页”“下一页”等。将下载文献PDF格式转换成WORD格式后的文字会 有错别字,要讲这些干扰字符、干扰词、错别字去除和修改。整合完后构成水产动物疾病语料库。Step 1. Collect aquatic animal disease texts, perform data cleaning and data preprocessing on the collected aquatic animal disease texts, and form an aquatic animal disease corpus; by using Python statements to crawl the aquatic disease website on the Internet, the literature and books are collected. For some aquatic animal disease corpus obtained from public accounts and websites, there are some characters, symbols, etc. that are not related to aquatic animal diseases in the crawled pages. Some noise words, such as "previous page", "next page", etc. After converting the downloaded document PDF format into WORD format, there will be typos in the text. It is necessary to talk about these interference characters, interference words, and typo removal and modification. After the integration, the aquatic animal disease corpus is formed.

步骤2.将语料库中的语料分成两部分,一部分为训练集一部分为测试集, 采用标注工具对训练集中的文本数据进行标注;标注方法为本文研究的是水产 疾病的关系,所以均是疾病与其他实体元素之间的关系,疾病的标签设为固定 标签,B-H-1表示实体头部,I-H-1表示实体中间部分。其余实体标签均用HB 表示该实体元素的头部,HI表示该实体元素的中间部分,O则表示该元素不属 于任何实体;本文在分析水产疾病语料后将主要关系分为5类,在实际标注中 取其汉语拼音首字母大写作为对应的实体标签;实体角色标签由数字“1”和“2” 表示,代表实体在所抽取的三元组中的先后顺序。在本文抽取的只是三元组中, 对不同关系的实体角色标签的规定为:在本文应用的水产疾病领域语料中,除 疾病外实体角色用“1”表示,其余关系的实体角色均用“2”来标注。本文用H+B/I 来表示角色1与角色2之间的重叠关系。标注结果如图1所示。Step 2. Divide the corpus into two parts, one part is the training set and the other is the test set, and the text data in the training set is annotated with an annotation tool; the annotation method is that this paper studies the relationship between aquatic diseases, so both diseases and The relationship between other entity elements, the label of the disease is set as a fixed label, B-H-1 represents the entity head, and I-H-1 represents the middle part of the entity. The rest of the entity labels use HB to represent the head of the entity element, HI to represent the middle part of the entity element, and O to represent that the element does not belong to any entity. In the labeling, the first letter of the Chinese pinyin is used as the corresponding entity label; the entity role label is represented by numbers "1" and "2", which represent the order of entities in the extracted triples. In this paper, only triples are extracted, and the provisions for the entity role labels of different relationships are: In the aquatic disease field corpus applied in this paper, the entity roles except diseases are represented by "1", and the entity roles of other relationships are represented by "1". 2" to mark. This paper uses H+B/I to represent the overlapping relationship between role 1 and role 2. The labeling results are shown in Figure 1.

步骤3:将标注完的数据集采用BERT模型进行数据预训练,自动获取词 语语义上的特征,并对深层次语义进行表示和抽取;Step 3: Use the BERT model to pre-train the marked data set, automatically obtain the semantic features of the words, and express and extract the deep-level semantics;

步骤4:对预训练结束后的文本进行标签嵌入,将标签信息嵌入到词和标 签的联合空间中,与每个字进行学习。标签信息直接蕴含与关系有关的信息, 充分利用标签信息能够学习更好的句子表示,间接提高关系抽取准确率;Step 4: Perform label embedding on the text after pre-training, embed the label information into the joint space of words and labels, and learn with each word. The label information directly contains the information related to the relationship, making full use of the label information can learn better sentence representation and indirectly improve the accuracy of relationship extraction;

步骤5:将联合学习后的数据送入到Bilstm模型中,学习到标签嵌入层的 输出信息进行进一步的语义编码,更好的获取长距离词的相关性和上下文信 息;Step 5: Send the jointly learned data into the Bilstm model, learn the output information of the label embedding layer for further semantic encoding, and better obtain the correlation and context information of long-distance words;

步骤6:将经过Bilstm模型学习后的文本数据送入到Attention中,可以在 大量信息中集中注意力地处理有用信息,减少文本序列中关键信息的丢失;Step 6: Send the text data learned by the Bilstm model into Attention, which can focus on processing useful information in a large amount of information and reduce the loss of key information in the text sequence;

步骤7:步骤7:最后将文本数据送入到CRF模型中,CRF层通过学习标 注间的约束条件提升标注预测的准确性,从而得到最终的预测标签序列。Step 7: Step 7: Finally, the text data is sent to the CRF model, and the CRF layer improves the accuracy of label prediction by learning the constraints between labels, thereby obtaining the final predicted label sequence.

设输入序列为X=(X1,X2,...,Xn),Attention层输出的概率矩阵为P, 输出的标注序列Y=(Y1,Y2,...,Yn)。根据式(1)计算出其预测序列得 分S(X,Y),得分最高的序列即为最终的输出序列。Let the input sequence be X=(X1, X2,..., Xn), the probability matrix output by the Attention layer is P, and the output label sequence Y=(Y1, Y2,..., Yn). According to formula (1), the predicted sequence score S(X, Y) is calculated, and the sequence with the highest score is the final output sequence.

Figure BDA0003545949920000051
Figure BDA0003545949920000051

其中,Ayi,yi+1表示概率中转移矩阵由标注Yi转移到标注Yi+1的概率中, Pi,Yi表示被Xi标注为Yi的概率。Among them, Ayi,yi+1 indicates that the transition matrix in the probability is transferred from the label Yi to the probability of labeling Yi+1, and Pi, Yi indicates the probability that Xi is labelled as Yi.

本发明实习方法与现有技术的实体关系联合抽取效果对比图如表一所示。 表1分别是从上到下分别为BiLSTM+CRF模型、BERT+BiGRU+CRF模型、 BERT+BiLSTM+CRF模型、本模型的实体关系联合抽取结果。对比结果表明 BiLSTM+CRF等模型的方法对篇章级别关系抽取抽取效果不足。本发明提出的 模型更准确,并解决了篇章级关系抽取问题和大量重叠关系的问题。Table 1 shows a comparison diagram of the joint extraction effect of the practice method of the present invention and the entity relationship of the prior art. Table 1 shows the joint extraction results of BiLSTM+CRF model, BERT+BiGRU+CRF model, BERT+BiLSTM+CRF model, and this model from top to bottom. The comparison results show that the methods of BiLSTM+CRF and other models have insufficient effect on the extraction of text-level relations. The model proposed by the present invention is more accurate, and solves the problem of text-level relationship extraction and a large number of overlapping relationships.

表1抽取结果表Table 1 Extraction result table

Figure BDA0003545949920000061
Figure BDA0003545949920000061

以上公开的仅为本发明的几个具体实施例,本领域的技术人员可以对本发 明实施例进行各种改动和变型而不脱离本发明的精神和范围,但是,本发明实 施例并非局限于此,任何本领域的技术人员能思之的变化都应落入本发明的保 护范围内。The above disclosures are only a few specific embodiments of the present invention. Those skilled in the art can make various changes and modifications to the embodiments of the present invention without departing from the spirit and scope of the present invention. However, the embodiments of the present invention are not limited thereto. , any changes that can be conceived by those skilled in the art should fall within the protection scope of the present invention.

Claims (5)

1.一种水产动物疾病文本的实体语义关系抽取方法,其特征在于,包括:1. an entity semantic relation extraction method of aquatic animal disease text, is characterized in that, comprises: 收集水产动物疾病文本;Collection of aquatic animal disease texts; 使用标注工具对文本数据标注;Use annotation tools to annotate text data; 将标注完的数据集输入BERT模型,自动获取词语义上的特征、并表示和抽取深层次语义,得到第二文本;Input the marked data set into the BERT model, automatically obtain the features of the word semantics, express and extract the deep-level semantics, and obtain the second text; 对第二文本进行标签嵌入,将标签信息嵌入第二文本的词和标签的联合空间、并与每个字进行联合学习,输出第三文本;Perform label embedding on the second text, embed the label information into the joint space of the words and labels of the second text, and perform joint learning with each word, and output the third text; 将联合学习的第三文本输入Bilstm模型进行学习,对学习到的标签嵌入层的输出信息进一步语义编码,获取长距离词的相关性和上下文信息,得到第四文本;Input the third text of joint learning into the Bilstm model for learning, further semantically encode the output information of the learned label embedding layer, obtain the correlation and context information of long-distance words, and obtain the fourth text; 将第四文本送入到Attention层,在大量信息中集中注意力地处理有用信息,减少文本序列中关键信息的丢失,获得第五文本;Send the fourth text to the Attention layer, focus on processing useful information in a large amount of information, reduce the loss of key information in the text sequence, and obtain the fifth text; 将第五文本输入CRF层,得到最终的预测标签序列,进而得到水产动物疾病文本实体语义关系联合抽取的结果。Input the fifth text into the CRF layer to obtain the final predicted label sequence, and then obtain the result of joint extraction of semantic relations of aquatic animal disease text entities. 2.如权利要求1所述的一种水产动物疾病文本的实体语义关系抽取方法,其特征在于,还包括对收集到的水产动物疾病文本进行数据预处理,其包括:2. The entity semantic relation extraction method of a kind of aquatic animal disease text as claimed in claim 1, is characterized in that, also comprises carrying out data preprocessing to the aquatic animal disease text collected, and it comprises: 通过用Python语句对网络上水产疾病网站进行数据爬取;Data crawling of aquatic disease websites on the Internet by using Python statements; 整合文献、书籍上的数据;Integrate data from literature and books; 清洗无用数据。Clean useless data. 3.如权利要求1所述的一种水产动物疾病文本的实体语义关系抽取方法,其特征在于,还包括将语料库中的语料分成两部分,一部分为训练集一部分为测试集,采用标注工具对训练集中的文本数据进行标注。3. the entity semantic relation extraction method of a kind of aquatic animal disease text as claimed in claim 1, is characterized in that, also comprises the corpus in the corpus is divided into two parts, one part is training set and the other part is test set, adopts labeling tool to Annotate the text data in the training set. 4.如权利要求3所述的一种水产动物疾病文本的实体语义关系抽取方法,其特征在于,所述采用标注工具对训练集中的文本数据进行标注的标注方法,包括:4. the entity semantic relation extraction method of a kind of aquatic animal disease text as claimed in claim 3, is characterized in that, the labeling method that described adopts labeling tool to label the text data in training set, comprising: 疾病的标签设为固定标签,B-H-1表示实体头部,I-H-1表示实体中间部分;The label of the disease is set as a fixed label, B-H-1 represents the head of the entity, and I-H-1 represents the middle part of the entity; 实体标签均采用HB表示实体元素的头部,HI表示实体元素的中间部分,O则表示实体元素不属于任何实体。Entity tags all use HB to represent the head of the entity element, HI to represent the middle part of the entity element, and O to indicate that the entity element does not belong to any entity. 5.如权利要求1所述的一种水产动物疾病文本的实体语义关系抽取方法,其特征在于,所述得到最终的预测标签序列的步骤,包括:5. a kind of entity semantic relation extraction method of aquatic animal disease text as claimed in claim 1 is characterized in that, the described step of obtaining final predicted label sequence, comprises: 设定输入序列X=(X1,X2,...,Xn);Set the input sequence X=(X 1 , X 2 , . . . , X n ); 获得Attention层输出概率矩阵P;Obtain the output probability matrix P of the Attention layer; CRF层输出的标注序列Y=(Y1,Y2,...,Yn);The labeling sequence Y=(Y 1 , Y 2 , . . . , Y n ) output by the CRF layer; 根据下面公式计算预测序列得分S(X,Y),得分最高的序列为最终的输出序列;Calculate the predicted sequence score S(X, Y) according to the following formula, and the sequence with the highest score is the final output sequence;
Figure FDA0003545949910000021
Figure FDA0003545949910000021
其中,Ayi,yi+1表示概率中转移矩阵由标注Yi转移到标注Yi+1的概率,Pi,yi表示被Xi标注为Yi的概率。Among them, A yi,yi+1 represents the probability that the transition matrix in the probability is transferred from the label Y i to the label Y i+1 , and P i, yi represents the probability that X i is labelled as Y i .
CN202210248668.2A 2022-03-14 2022-03-14 A relation extraction method for aquatic animals and disease texts Pending CN115033703A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210248668.2A CN115033703A (en) 2022-03-14 2022-03-14 A relation extraction method for aquatic animals and disease texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210248668.2A CN115033703A (en) 2022-03-14 2022-03-14 A relation extraction method for aquatic animals and disease texts

Publications (1)

Publication Number Publication Date
CN115033703A true CN115033703A (en) 2022-09-09

Family

ID=83118620

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210248668.2A Pending CN115033703A (en) 2022-03-14 2022-03-14 A relation extraction method for aquatic animals and disease texts

Country Status (1)

Country Link
CN (1) CN115033703A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818676A (en) * 2021-02-02 2021-05-18 东北大学 Medical entity relationship joint extraction method
CN112989831A (en) * 2021-03-29 2021-06-18 华南理工大学 Entity extraction method applied to network security field
CN113434700A (en) * 2021-07-09 2021-09-24 大连海洋大学 Method for constructing knowledge graph for diagnosing, preventing and treating aquatic animal diseases
WO2021190236A1 (en) * 2020-03-23 2021-09-30 浙江大学 Entity relation mining method based on biomedical literature

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021190236A1 (en) * 2020-03-23 2021-09-30 浙江大学 Entity relation mining method based on biomedical literature
CN112818676A (en) * 2021-02-02 2021-05-18 东北大学 Medical entity relationship joint extraction method
CN112989831A (en) * 2021-03-29 2021-06-18 华南理工大学 Entity extraction method applied to network security field
CN113434700A (en) * 2021-07-09 2021-09-24 大连海洋大学 Method for constructing knowledge graph for diagnosing, preventing and treating aquatic animal diseases

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
刘子晴: "中医门诊电子病历关键临床信息抽取方法研究", 《中国博士论文全文数据库医药卫生科技辑》, no. 2, 15 February 2022 (2022-02-15), pages 37 - 39 *
李灵芳;杨佳琦;李宝山;杜永兴;胡伟健;: "基于BERT的中文电子病历命名实体识别", 内蒙古科技大学学报, no. 01, 15 March 2020 (2020-03-15) *
杨鹤 等: "基于双重注意力机制的渔业标准实体关系抽取", 《农业工程学报》, vol. 37, no. 14, 16 July 2021 (2021-07-16), pages 204 - 212 *
王仁武;孟现茹;孔琦;: "实体―属性抽取的GRU+CRF方法", 现代情报, no. 10, 15 October 2018 (2018-10-15) *

Similar Documents

Publication Publication Date Title
CN111858944B (en) Entity aspect level emotion analysis method based on attention mechanism
CN109918644B (en) A named entity recognition method for TCM health consultation text based on transfer learning
CN111078875B (en) Method for extracting question-answer pairs from semi-structured document based on machine learning
CN111382272B (en) Electronic medical record ICD automatic coding method based on knowledge graph
CN109471895B (en) Electronic medical record phenotype extraction and phenotype name normalization method and system
Tang et al. Knowledge representation learning with entity descriptions, hierarchical types, and textual relations
CN111966917A (en) Event detection and summarization method based on pre-training language model
CN111949759A (en) Medical record text similarity retrieval method, system and computer equipment
WO2023029502A1 (en) Method and apparatus for constructing user portrait on the basis of inquiry session, device, and medium
CN110348008A (en) Medical text based on pre-training model and fine tuning technology names entity recognition method
CN115796181A (en) Text relation extraction method for chemical field
CN108959566B (en) A kind of medical text based on Stacking integrated study goes privacy methods and system
CN110931128B (en) Method, system and device for automatically identifying unsupervised symptoms of unstructured medical texts
CN113033203A (en) Structured information extraction method oriented to medical instruction book text
CN114881043B (en) Method and system for evaluating semantic similarity of legal documents based on deep learning model
CN113378571A (en) Entity data relation extraction method of text data
CN108108354A (en) A kind of microblog users gender prediction's method based on deep learning
CN116522945A (en) Model and method for identifying named entities in food safety field
CN113935324A (en) Cross-border ethnic cultural entity recognition method and device based on word set feature weighting
CN117874252A (en) A method for constructing a knowledge graph and related equipment
CN111522948A (en) A method and system for intelligently processing official documents
CN111611780A (en) Deep learning-based reporting structure method and system for digestive endoscopy
CN109977229A (en) A kind of biomedical name entity recognition method based on all-purpose language feature
CN116737924A (en) Medical text data processing method and device
CN109255098B (en) A matrix factorization hashing method based on reconstruction constraints

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination