CN112699662B

CN112699662B - False information early detection method based on text structure algorithm

Info

Publication number: CN112699662B
Application number: CN202011632799.8A
Authority: CN
Inventors: 王莉; 王宇航; 杨延杰
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2022-08-16
Anticipated expiration: 2040-12-31
Also published as: CN112699662A

Abstract

The present invention is a method for early detection of false information based on a text structure algorithm, which belongs to the technical field of false information detection based on a text structure algorithm; The technical solutions used to solve this technical problem are: document discourse unit acquisition, discourse unit representation learning, and structure representation learning. Based on the rhetorical structure theory, a document discourse structure graph is constructed, and the global context of the discourse unit is obtained through a multi-relational graph neural network. Structural representation, contextual representation learning takes the positional adjacent relationship of the discourse unit in the document as the calculation object, and obtains the local context representation of the discourse unit. Based on the gated recursive unit fused with the global attention mechanism, all discourse units in the document The document representation is formed by fusion, and the generated document representation is used for false information detection to obtain the probability value of whether the input document belongs to false information; the present invention is applied to false information detection.

Description

An early detection method of false information based on text structure algorithm

技术领域technical field

本发明一种基于文本结构算法的虚假信息早期检测方法，属于基于文本结构算法的虚假信息检测技术领域。The invention relates to a method for early detection of false information based on a text structure algorithm, which belongs to the technical field of false information detection based on a text structure algorithm.

背景技术Background technique

现有的虚假信息检测算法主要关注于利用信息文本内容和其他外部信息，基于外部信息的虚假新闻检测算法以信息在社交网络传播过程中产生的衍生特征为研究对象，通常考虑如信息的转发及评论、用户画像、传播时间和文章来源等特征，该算法虽然取得了一定成果，但也存在不足，由于其涉及外部信息，采集数据时需耗费大量时间和人力资源，且有些数据属于敏感信息(如用户画像)，触及隐私保护问题，另外采集到的非结构化数据类型复杂、包含大量噪声和缺失值，需进一步清洗和预处理，这无疑增加了虚假信息检测任务的难度和工作量；因此，上述虚假信息检测算法不能在无外部辅助信息的情况下直接判断信息真假，面临数据采集困难、数据缺失和噪声等问题，导致对虚假信息检测的效率低下，时效性差，不利于在虚假信息爆发的早期阶段及时遏制和止损。Existing false information detection algorithms mainly focus on the use of information text content and other external information. Fake news detection algorithms based on external information take the derivative features of information generated in the process of social network dissemination as the research object. Although the algorithm has achieved certain results, it also has shortcomings. Because it involves external information, it takes a lot of time and human resources to collect data, and some data are sensitive information ( Such as user portraits), involving privacy protection issues, in addition, the collected unstructured data types are complex, contain a lot of noise and missing values, and require further cleaning and preprocessing, which undoubtedly increases the difficulty and workload of false information detection tasks; therefore , the above false information detection algorithms cannot directly judge the authenticity of information without external auxiliary information, and face problems such as data collection difficulties, data missing and noise, resulting in low efficiency and poor timeliness for false information detection, which is not conducive to the detection of false information. In the early stages of the outbreak, contain and stop losses in a timely manner.

文本是信息的主要载体，仅依赖文本内容实现虚假信息的自动检测是最为直接且便利的方法；目前基于文本内容的虚假信息检测研究大多关注于利用虚假信息与真实信息在文本语言特征上的差异，通过机器学习或深度学习算法实现虚假信息检测；基于机器学习的虚假信息检测算法首先通过特征工程构建如Ngrams、标点符号、心理语言学单词和情感极性等特征，并将提取出的特征集输入支持向量机(SVM:SupportVectorMachine)、逻辑斯谛回归(LR:LogisticRegression)等机器学习模型以实现虚假信息检测；这些算法需要手工构建离散语言特征，较为繁琐耗时且很难找到最优特征组合，检测效率低下，同时无法适应虚假信息的演化。Text is the main carrier of information. It is the most direct and convenient method to realize automatic detection of false information only by relying on text content. At present, most researches on false information detection based on text content focus on using the difference between false information and real information in text language features. , to achieve false information detection through machine learning or deep learning algorithms; machine learning-based false information detection algorithms first construct features such as Ngrams, punctuation marks, psycholinguistic words, and sentiment polarity through feature engineering, and then extract the feature set. Input support vector machine (SVM: SupportVectorMachine), logistic regression (LR: LogisticRegression) and other machine learning models to achieve false information detection; these algorithms need to manually construct discrete language features, which are cumbersome and time-consuming and it is difficult to find the optimal feature combination , the detection efficiency is low, and it cannot adapt to the evolution of false information.

深度学习被认为比机器学习更能发现文本中潜在的信息，在利用语义知识方面表现出明显优势；基于深度学习的虚假信息检测算法应用循环神经网络(RNN:RecurrentNeuralNetwork)和卷积神经网络(CNN:ConvolutionalNeuralNetwork)等模型，能够编码文本中的上下文信息以及单词之间的远距离依赖关系，自动的学习文本内容的深层次语义表示；然而目前已有的基于深度学习的虚假信息检测算法更加专注于文档或句子的简单表示学习，忽视了对文档高级文本结构和关键上下文信息的有效建模；另一方面，由于上述工作以文本中的词或句子为计算对象，存在噪声问题和句子过长表示不足问题。Deep learning is considered to be more capable of discovering potential information in text than machine learning, and has shown obvious advantages in utilizing semantic knowledge; deep learning-based false information detection algorithms apply Recurrent Neural Network (RNN: Recurrent Neural Network) and Convolutional Neural Network (CNN) :ConvolutionalNeuralNetwork) and other models, which can encode the context information in the text and the long-distance dependencies between words, and automatically learn the deep semantic representation of the text content; however, the existing deep learning-based false information detection algorithms are more focused on Simple representation learning of documents or sentences ignores the effective modeling of high-level text structure and key contextual information of documents; on the other hand, since the above work takes words or sentences in text as the computational objects, there are noise problems and long sentence representations. shortage problem.

发明内容SUMMARY OF THE INVENTION

本发明为了克服现有技术中存在的不足，所要解决的技术问题为：提供一种基于文本结构算法的虚假信息早期检测方法的改进。In order to overcome the deficiencies in the prior art, the technical problem to be solved by the present invention is: to provide an improvement of a false information early detection method based on a text structure algorithm.

为了解决上述技术问题，本发明采用的技术方案为：一种基于文本结构算法的虚假信息早期检测方法，包括如下检测步骤：In order to solve the above-mentioned technical problems, the technical solution adopted in the present invention is: a method for early detection of false information based on a text structure algorithm, comprising the following detection steps:

步骤一：建立计算模块一，用于对文档语篇单元获取，此模块将待检测文档分段，得到其最小语篇单元EDU；Step 1: establish a calculation module 1, which is used to obtain document discourse units. This module divides the document to be detected to obtain its minimum discourse unit EDU;

步骤二：建立计算模块二，用于语篇单元的表示学习；Step 2: Establish a calculation module 2, which is used for the representation learning of discourse units;

步骤2.1：基于语篇修辞结构进行语篇单元的结构表示学习，基于修辞结构理论(RST)构建文档语篇结构图，通过多关系型图神经网络得到语篇单元的全局结构表示；Step 2.1: Learning the structural representation of discourse units based on discourse rhetorical structure, constructing a document discourse structure graph based on Rhetorical Structure Theory (RST), and obtaining the global structural representation of discourse units through a multi-relational graph neural network;

使用依存结构构建文档语篇结构图：Use the dependency structure to build a document discourse structure diagram:

步骤2.1.1：设待检测文档D_i的语篇结构图为G_i＝(V，E)，其中节点集表达为：Step 2.1.1: Let the discourse structure diagram of the document _{Di to be detected be G i} ₌ (V, E), where the node set is expressed as:

表示图中的|U|个节点，每个节点是一个最小语篇单兀EDU；

Represents |U| nodes in the graph, each node is a minimal discourse unit EDU;

定义语篇单元两两之间依照特定的修辞关系相连，修辞关系属于集合R′，则语篇单元之间链接集合边集E的表达式为：It is defined that two discourse units are connected according to a specific rhetorical relation, and the rhetorical relation belongs to the set R′, then the expression of the link set edge set E between discourse units is:

E＝{(EDU_u，r，EDU_v)|EDU_u∈V，EDU_v∈V，r∈R′}；E={(EDU _u , r, EDU _v )|EDU _u ∈V, EDU _v ∈V, r∈R′};

步骤2.1.2：定义该语篇结构图的邻接矩阵和特征矩阵用于计算和学习结构表示，其中邻接矩阵描述节点之间的拓扑结构，特征矩阵描述节点的特征表示；Step 2.1.2: Define the adjacency matrix and feature matrix of the discourse structure graph for calculating and learning the structure representation, where the adjacency matrix describes the topology between nodes, and the feature matrix describes the feature representation of the nodes;

定义邻接矩阵为A∈R^|U|*|U|，矩阵中任意元素A_uv表示为：The adjacency matrix is defined as A∈R ^|U|*|U| , and any element A _uv in the matrix is expressed as:

定义特征矩阵为：Define the feature matrix as:

X⁽⁰⁾∈R^|U|*m；X ⁽⁰⁾ ∈R ^|U|*m ;

上式中，任意语篇单元节点u的向量表示为x_u∈R^m，m为语篇单元向量的维度，假设共有|R′|种类型的边和它相连，则该节点u基于其中任意一种类型边r的所有邻居集合表示为

In the above formula, the vector of any discourse unit node u is represented as x _u ∈R ^m , where m is the dimension of the discourse unit vector. Assuming that there are |R′| types of edges connected to it, the node u is based on any of the The set of all neighbors of an edge r of a type is expressed as

步骤2.1.3：采用多关系型图注意力网络RGAT在根据文本结构传递和聚合邻居信息的过程中使用图注意力突出关键邻居的信息，采用多关系型图注意力网络RGAT更新语篇单元节点u特征表示的过程为：Step 2.1.3: Use the multi-relational graph attention network RGAT to highlight the information of key neighbors in the process of transferring and aggregating neighbor information according to the text structure, and use the multi-relational graph attention network RGAT to update the discourse unit nodes The process of u feature representation is:

首先获得该节点基于一种关系r相连的所有邻居的相对重要性，并分配不同的注意力权值，然后按照权重聚合邻居信息得到节点u属于该关系的节点特征表示，最后整合节点u的所有相邻关系得到包含丰富多关系结构信息的特征表示；First obtain the relative importance of all neighbors connected by the node based on a relationship r, and assign different attention weights, then aggregate the neighbor information according to the weight to obtain the node feature representation of the node u belonging to the relationship, and finally integrate all the nodes of the node u. Adjacent relationships get feature representations that contain rich multi-relational structural information;

步骤2.2：基于上下文顺序关系进行上下文表示学习，以语篇单元在文档中的位置相邻关系为计算对象，基于TextCNN得到语篇单元的局部上下文表示；Step 2.2: Perform context representation learning based on the context order relationship, take the positional adjacent relationship of the text unit in the document as the calculation object, and obtain the local context representation of the text unit based on TextCNN;

上下文表示学习模块按照语篇单元在文档中的顺序将原始特征表示排列为特征矩阵X⁽⁰⁾∈R^|U|*m，使用1-DTextCNN通过滑动窗口w在连续语篇单元(EDU)之间卷积的方式，捕捉局部上下文语义的相关关系；The context representation learning module arranges the original feature representation into a feature matrix X ⁽⁰⁾ ∈ R ^|U|*m according to the order of the text units in the document, and uses 1-DTextCNN to pass the sliding window w between consecutive text units (EDU). The inter-convolution method captures the correlation of local context semantics;

步骤三：生成文档表示，基于融合全局注意力机制的门控递归单元，将文档的所有语篇单元融合形成文档表示；Step 3: Generate a document representation, and fuse all the textual units of the document to form a document representation based on the gated recursive unit fused with the global attention mechanism;

步骤四：检测虚假信息，将生成的文档表示用于虚假信息检测，得到该输入文档是否属于虚假信息的概率值。Step 4: Detect false information, represent the generated document for false information detection, and obtain a probability value of whether the input document belongs to false information.

所述步骤一建立计算模块一的具体过程步骤为：The specific process steps of the step 1 establishing the calculation module 1 are:

利用StanfordCorNLP工具对待检测文档进行分句和词语切分，将切分后的文本输入DPLP工具，得到该文档的所有语篇单元EDUs。The StanfordCorNLP tool is used to perform sentence and word segmentation of the document to be detected, and the segmented text is input into the DPLP tool to obtain all the discourse units EDUs of the document.

所述步骤2.1.3中采用多关系型图注意力网络RGAT更新语篇单元节点u特征表示的具体过程为：The specific process of using the multi-relational graph attention network RGAT to update the feature representation of the discourse unit node u in the step 2.1.3 is as follows:

设RGAT由l层组成，每层将得到的特征矩阵传递给下一层，则节点u在第l层的特征表示x_u ^(l)的表达式为：Suppose RGAT consists of l layers, and each layer passes the obtained feature matrix to the next layer, then the expression of the feature representation x _u ^(l) of the node u in the l-th layer is:

式中

是节点u在第(l-1)层，基于关系r相邻的邻居v的特征向量表示；u^r是特定类型关系边r的参数矩阵，在网络训练过程中得到学习和优化；ReLU是激活函数，使用LeakyReLU以提高收敛速度防止梯度消失；in the formula

is the eigenvector representation of node u in the (l-1)th layer, based on the adjacent neighbor v of relation ^r ; ur is the parameter matrix of a specific type of relation edge r, which is learned and optimized during network training; ReLU is the activation function, use LeakyReLU to improve the convergence speed and prevent the gradient from disappearing;

上式中

用于衡量基于关系r相连的邻居v相对于节点u的重要性，

的计算公式为：In the above formula

It is used to measure the importance of the neighbor v connected based on the relation r relative to the node u,

The calculation formula is:

式中

为了降低参数过多带来的影响，使用偏差分解来减少模型参数，得到语篇单元的全局结构表示X_G∈R^|U|*m′，其中m′是经过多关系型图注意力网络更新后节点表示的维数。in the formula

In order to reduce the influence of too many parameters, the bias decomposition is used to reduce the model parameters, and the global structure representation of the discourse unit X _G ∈ R ^|U|*m′ is obtained, where m′ is updated by the multi-relational graph attention network The dimension represented by the back node.

所述步骤2.2中捕捉局部上下文语义相关关系的具体过程为：设Text-CNN有滤波器w∈R^k*m，窗口大小为k，表明同时有k个EDU在窗口中，每个EDU的向量维度为m；The specific process of capturing the semantic correlation of local context in the step 2.2 is as follows: Let Text-CNN have a filter w∈R ^k*m , and the window size is k, indicating that there are k EDUs in the window at the same time, and the vector of each EDU is dimension is m;

定义滤波器w∈R^k*m的数量为m′个，并且设置填充操作pad防止卷积过程中丢失数据，然后将滤波器应用到窗口上，依次从第一个EDU滑动到最后一个EDU，在m′次卷积后获得了语篇单元的局部上下文表示X_C∈R^|U|*m′；Define the number of filters w∈R ^k*m as m′, and set the padding operation pad to prevent data loss during the convolution process, then apply the filter to the window, and slide from the first EDU to the last EDU in turn, The local context representation X _C ∈R ^|U|*m′ of the discourse unit is obtained after m′ convolutions;

将局部上下文表示X_C与结构表示学习模块得到的全局结构表示X_G进行拼接，得到包含两种文本结构信息的语篇单元高级表示X_GC∈R^|U|*m′。The local context representation X _C is spliced with the global structure representation X _G obtained by the structure representation learning module, and the high-level representation X _GC ∈ R ^|U|*m′ of the discourse unit containing the two kinds of text structure information is obtained.

所述步骤三生成文档表示的具体过程为：The specific process of generating the document representation in the third step is as follows:

定义一个融合全局注意力机制的门控递归单元，根据该单元组成文档表示生成模块；Define a gated recursive unit that integrates the global attention mechanism, and form a document representation generation module according to this unit;

所述门控递归单元(GRU)顺序模型首先对整篇文档的语篇单元高级表示按照由上到下的全局阅读顺序进行再学习，在文档全局角度表达语序信息；The gated recursive unit (GRU) sequential model first relearns the high-level representation of the discourse unit of the entire document according to the global reading order from top to bottom, and expresses the word order information from the global perspective of the document;

设GRU网络共有T个时间步，每个时间步输入一个语篇单元表示，在第t个时间步时，获得GRU的隐藏状态h_t的计算方式如下：Suppose the GRU network has T time steps in total, and each time step is represented by a discourse unit input. At the t-th time step, the hidden state h _t of the GRU is obtained as follows:

式中x_t为当前时刻的输入，h_t-1为t-1时刻的隐藏层输出，重置门r_t表明上一时刻隐藏层记忆在当前记忆中的重要程度，更新门Z_t表明当前的记忆增量

和过去记忆h_t-1的比例，W_r，W_z，W_h是待学习的参数；where x _t is the input at the current time, h _t-1 is the output of the hidden layer at time t-1, the reset gate _rt indicates the importance of the memory of the hidden layer in the current memory at the previous moment, and the update gate Z _t indicates the current memory. memory increment

and the ratio of past memory h _t-1 , W _r , W _z , W _h are parameters to be learned;

经过T个时间步后，得到所有时间步的隐藏状态表示为h₁，...，h_T；After T time steps, the hidden states of all time steps are obtained as h ₁ , . . . , h _T ;

使用全局注意力在融合所有隐藏状态并生成文档表示，所述全局注意力考虑所有隐藏状态的重要性权重，对于第s个隐藏状态，在文档中的权重a_t(s)由s时刻的状态h_s与最后一个时刻T输出的状态h_T计算得到，计算公式如下：The document representation is generated by fusing all hidden states using global attention that considers the importance weights of all hidden states, for the sth hidden state, the weight at ( _s ) in the document is determined by the state at time s h _s is calculated from the state h _T output at the last moment T, and the calculation formula is as follows:

加权后得到文档表示如下：After weighting, the document representation is as follows:

所述步骤四检测虚假信息的具体过程为：The specific process of detecting false information in the fourth step is as follows:

使用带有softmax激活函数的全连接层，将待测文档表示Z映射为是否属于虚假信息的概率值，概率值的计算公式为：Using a fully connected layer with a softmax activation function, the document representation Z to be tested is mapped to a probability value of whether it belongs to false information. The calculation formula of the probability value is:

式中

表示新闻为真或假的预测标签概率值，W_f是权重，b_f是偏置项；in the formula

represents the predicted label probability value of whether the news is true or false, W _f is the weight, and b _f is the bias term;

定义交叉熵损失函数为：The cross entropy loss function is defined as:

式中θ是整个算法网络的参数，Y∈{0，1}是真实标签取值。where θ is the parameter of the entire algorithm network, and Y∈{0, 1} is the value of the true label.

本发明相对于现有技术具备以下的有益效果：The present invention has the following beneficial effects with respect to the prior art:

一、本发明仅依赖文本内容实现虚假信息的有效检测，该方法直接、便利且数据采集难度小，在没有辅助信息的情况下直接判断假信息，更有利于虚假信息爆发的早期阶段及时判断信息真实性，可以从源头遏制虚假信息，最小化虚假信息带来的损失；1. The present invention only relies on text content to achieve effective detection of false information. The method is direct, convenient, and has little difficulty in data collection. It can directly judge false information without auxiliary information, which is more conducive to timely judgment of false information in the early stage of the outbreak. Authenticity, which can contain false information from the source and minimize the losses caused by false information;

二、本发明从文本结构的角度探究文档内部隐含的功能性结构关系和上下文顺序关系，利用结构特性提升文档表示，不容易被信息伪造者发现和对抗，具有安全性和可靠性；本发明仅依赖信息载体文本，具有早识别，复杂度低的特点，同时本发明以文本的语篇单元为计算对象，避免了基于词的文档表示的噪音问题和基于句子的文档表示不足问题，提高了基于文本内容的虚假信息自动检测效果；2. The present invention explores the implicit functional structural relationship and contextual sequence relationship inside the document from the perspective of text structure, and uses structural characteristics to improve document representation, which is not easy to be discovered and confronted by information forgers, and has security and reliability; the present invention It only relies on the information carrier text, and has the characteristics of early recognition and low complexity. At the same time, the present invention takes the textual unit of the text as the calculation object, avoiding the noise problem of word-based document representation and the problem of insufficient sentence-based document representation, and improves the performance of the text. The effect of automatic detection of false information based on text content;

三、本发明提出的检测方法具有先进性、稳定性和实用性，其应用范围广泛，识别虚假信息的准确率高，具有高泛化能力，在公开数据集上达到了8.55％的F1值提升；本发明基于文本结构特征，实现具有高泛化性能的深度学习模型以进行虚假信息早期检测。3. The detection method proposed by the present invention has advanced nature, stability and practicability. It has a wide range of applications, high accuracy in identifying false information, and high generalization ability. It has achieved an F1 value improvement of 8.55% on the public data set. ; The present invention realizes a deep learning model with high generalization performance based on the text structure feature for early detection of false information.

附图说明Description of drawings

下面结合附图对本发明做进一步说明：The present invention will be further described below in conjunction with the accompanying drawings:

图1为本发明虚假信息检测算法整体流程图；Fig. 1 is the overall flow chart of the false information detection algorithm of the present invention;

图2为本发明虚假信息检测算法整体模型图。FIG. 2 is an overall model diagram of the false information detection algorithm of the present invention.

具体实施方式Detailed ways

如图1和图2所示，由于本发明目的是针对信息的真实或虚假情况进行检测，具体任务目标可以概述为将待检测文档分类为是否属于虚假信息的二分类问题，本发明的实施例如下：As shown in FIG. 1 and FIG. 2 , since the purpose of the present invention is to detect whether the information is true or false, the specific task objective can be summarized as the two-class problem of classifying the document to be detected as whether it belongs to false information. The embodiment of the present invention is as follows: Down:

建立计算模块一，用于对文档语篇单元获取，此模块将待检测文档分段，得到其最小语篇单元。A calculation module is established to obtain the textual unit of the document. This module divides the document to be detected to obtain its minimum textual unit.

本发明以语篇单元为计算对象，摒弃了以词或句子为对象在模型学习过程中产生的问题，因此首先设计方案用于获取最小语篇单元。最小语篇单元(EDU：ElementaryDiscourseUnit)是一篇文档的基本语言单位，一般表示为从句，最短可以是短语。根据修辞结构理论RST，确定这些语篇单元之间的关系，例如条件关系、解释关系、证据关系等。为了准确划分最小语篇单元(EDU)，本发明遵循Ji等人提出的预训练方法DPLP(DiscourseParsingfromLinearProjection)。该方法使用黄金EDU分段方法。本发明将待检测文档分段从而获得最小语篇单元。过程介绍如下：首先利用StanfordCorNLP工具对待检测文档进行分句和词语切分；然后将切分后的文本输入DPLP工具，得到该文档的所有语篇单元(EDUs)。The present invention takes the discourse unit as the calculation object, and abandons the problem of taking words or sentences as the object in the model learning process. Therefore, a scheme is first designed to obtain the smallest discourse unit. The minimum discourse unit (EDU: ElementaryDiscourseUnit) is the basic language unit of a document, generally expressed as a clause, and the shortest can be a phrase. According to the rhetorical structure theory RST, the relationship between these discourse units is determined, such as conditional relationship, explanatory relationship, evidence relationship and so on. In order to accurately divide the smallest discourse unit (EDU), the present invention follows the pre-training method DPLP (Discourse Parsing from Linear Projection) proposed by Ji et al. This method uses the golden EDU segmentation method. The present invention segments the document to be detected so as to obtain the smallest discourse unit. The process is described as follows: first, the StanfordCorNLP tool is used to perform sentence and word segmentation of the document to be detected; then, the segmented text is input into the DPLP tool to obtain all the discourse units (EDUs) of the document.

建立计算模块二，用于语篇单元的表示学习，该过程分为结构表示学习和上下文表示学习两个步骤。The second calculation module is established, which is used for the representation learning of discourse units. The process is divided into two steps: structural representation learning and contextual representation learning.

本发明的结构性特征能够反映虚假信息的潜在写作风格和写作意图，描述了文本的逻辑性、连贯性和复杂性等语言特性。因此本发明基于文本结构实现虚假信息早期检测。所利用的文本结构包括语篇修辞结构和上下文顺序关系，分别对应语篇单元的结构表示学习和上下文表示学习两个步骤：The structural features of the present invention can reflect the potential writing style and writing intention of false information, and describe language characteristics such as the logic, coherence and complexity of the text. Therefore, the present invention realizes early detection of false information based on the text structure. The text structure used includes discourse rhetorical structure and context order relationship, which correspond to the two steps of structural representation learning and context representation learning of discourse units respectively:

步骤一、结构表示学习。基于修辞结构理论(RST)构建文档语篇结构图，通过多关系型图神经网络得到语篇单元的全局结构表示。Step 1. Structure representation learning. The document discourse structure graph is constructed based on Rhetorical Structure Theory (RST), and the global structural representation of discourse units is obtained through a multi-relational graph neural network.

修辞结构理论是研究篇章结构的基础理论，可根据不同语篇单元之间的功能关系捕获故事的连贯性。该理论运用有确切含义的修辞关系构建包括RST结构树和RST依存树的两种树结构。由于RST结构树不能有效表达语篇单元之间的直接关系，本发明使用Li等人提出的RST语篇解析器生成依存结构构建文档语篇结构图。基于该依存图结构，能够直接表示语篇单元之间的修辞关系而不必考虑结构树中复杂的分层关系，不易引起歧义。以下是所构建的文档语篇结构图：Rhetorical structure theory is the basic theory for studying discourse structure, which can capture the coherence of stories according to the functional relationship between different discourse units. The theory uses rhetorical relations with precise meaning to construct two tree structures including RST structure tree and RST dependency tree. Since the RST structure tree cannot effectively express the direct relationship between discourse units, the present invention uses the RST discourse parser proposed by Li et al. to generate a dependency structure to construct a document discourse structure graph. Based on the dependency graph structure, the rhetorical relationship between discourse units can be directly represented without considering the complex hierarchical relationship in the structure tree, which is not easy to cause ambiguity. The following is the constructed document discourse structure diagram:

待检测文档D_i的语篇结构图G_i＝(V，E)，其中节点集为：The discourse structure graph _Gi ₌ (V, E) of the document Di to be detected, where the node set is:

表示该图中的|U|个节点，每个节点是一个最小语篇单元EDU。语篇单元两两之间依照特定的修辞关系相连。修辞关系属于集合R′(包含属性关系、目的关系、对比关系等)。边集E是语篇单元之间链接的集合，表达式为：E＝{(EDU_u，r，EDU_v)|EDU_u∈V，EDU_v∈V，r∈R′}。

Represents |U| nodes in the graph, each of which is a minimal discourse unit EDU. The two discourse units are connected according to a specific rhetorical relationship. Rhetorical relations belong to the set R' (including attribute relations, purpose relations, contrast relations, etc.). The edge set E is the set of links between discourse units, and the expression is: E={(EDU _u , r, EDU _v )|EDU _u ∈V, EDU _v ∈V, r∈R′}.

然后定义该语篇结构图的邻接矩阵和特征矩阵用于计算和学习结构表示。邻接矩阵描述节点之间的拓扑结构，特征矩阵描述节点的特征表示。定义邻接矩阵A∈R^|U|*|U|，表示为：The adjacency matrix and feature matrix of this discourse structure graph are then defined for computing and learning the structure representation. The adjacency matrix describes the topology between nodes, and the feature matrix describes the feature representation of the nodes. The adjacency matrix A∈R ^|U|*|U| is defined as:

定义特征矩阵为X⁽⁰⁾∈R^|U|*m。任意语篇单元节点u的向量表示为x_u∈R^m，假设共有|R|种类型的边和它相连，则该节点u基于其中任意一种类型边r的所有邻居集合表示为

Define the feature matrix as X ⁽⁰⁾ ∈R ^|U|*m . The vector of any discourse unit node u is denoted as x _u ∈R ^m , assuming that there are |R| types of edges connected to it, then the set of all neighbors of this node u based on any type of edge r is expressed as

多关系型图注意力网络(RGAT)能够在根据文本结构传递和聚合邻居信息的过程中使用图注意力突出关键邻居的信息。相比图注意力网络(GAT)，RGAT在处理多关系数据更有优势。GAT只关注节点局部邻居的注意力差异，不能同时考虑依照同种关系相连邻居信息的同质性和来自不同关系信息的全局差异性。RGAT更新语篇单元节点u特征表示的过程如下：首先获得该节点基于一种关系r相连的所有邻居的相对重要性，并为他们分配不同的注意力权值。按照权重聚合邻居信息得到节点u属于该关系的节点特征表示，最后整合节点u的所有相邻关系得到包含丰富多关系结构信息的特征表示。The multi-relational graph attention network (RGAT) can use graph attention to highlight the information of key neighbors in the process of transferring and aggregating neighbor information according to the text structure. Compared with graph attention network (GAT), RGAT has more advantages in processing multi-relational data. GAT only pays attention to the attention difference of local neighbors of nodes, and cannot simultaneously consider the homogeneity of neighbor information connected according to the same kind of relationship and the global difference from different relationship information. The process of RGAT updating the feature representation of a discourse unit node u is as follows: First, the relative importance of all neighbors connected by the node based on a relationship r is obtained, and different attention weights are assigned to them. The neighbor information is aggregated according to the weight to obtain the node feature representation of the node u belonging to the relationship, and finally all the adjacent relationships of the node u are integrated to obtain the feature representation containing rich multi-relational structure information.

假设RGAT由l层组成，每层将得到的特征矩阵传递给下一层。则节点u在第l层的特征表示x_u ^(l)可以写成：Assuming that RGAT consists of l layers, each layer passes the resulting feature matrix to the next layer. Then the feature representation x _u ^(l) of node u at layer l can be written as:

其中

是节点u在第(l-1)层，基于关系r相邻的邻居v的特征向量表示；w^r是特定类型关系边r的参数矩阵，在网络训练过程中得到学习和优化；ReLU是激活函数，这里使用LeakyReLU以提高收敛速度防止梯度消失。

用于衡量基于关系r相连的邻居v相对于节点u的重要性，计算如下：in

is the eigenvector representation of node u in the (l-1)th layer, based on the adjacent neighbor v of relation r; w ^r is the parameter matrix of a specific type of relation edge r, which is learned and optimized during network training; ReLU is the activation function, where LeakyReLU is used to improve the convergence speed and prevent the gradient from disappearing.

It is used to measure the importance of the neighbor v connected based on the relationship r relative to the node u, and the calculation is as follows:

其中

为了降低参数过多带来的影响，可以使用偏差分解来减少模型参数，得到语篇单元的全局结构表示X_G∈R^|U|*m′，其中m′是经过多关系型图注意力网络更新后节点表示的维数。in

In order to reduce the influence of too many parameters, the bias decomposition can be used to reduce the model parameters, and the global structure representation of the discourse unit X _G ∈ R ^|U|*m′ can be obtained, where m′ is the multi-relational graph attention network. The dimensionality of the updated node representation.

步骤二、上下文表示学习。以语篇单元在文档中的位置相邻关系为计算对象，利用TextCNN得到语篇单元的局部上下文表示。Step 2: Context representation learning. Taking the positional adjacent relationship of discourse unit in the document as the calculation object, the local context representation of discourse unit is obtained by using TextCNN.

语境是理解语义的关键。对于某个语篇单元的理解程度可能取决于上一个语篇单元的意思，并且影响着对下一个语篇单元的认识。通常相邻语篇单元具有一定逻辑关系，切换顺序可能会导致歧义。文档在语言表达上是否连贯、逻辑是否通顺均能透过相邻语篇单元的连续性反映出来。本发明考虑文章中连续相邻的语篇单元之间的局部上下文关系，利用固定大小的滑动窗口建模局部共现的语篇单元之间的连贯性和顺序性。Context is the key to understanding semantics. The level of comprehension of a discourse unit may depend on the meaning of the previous discourse unit and affect the understanding of the next discourse unit. Usually adjacent discourse units have a certain logical relationship, and switching order may lead to ambiguity. Whether a document is coherent in language expression and logically can be reflected by the continuity of adjacent discourse units. The present invention considers the local context relationship between consecutive adjacent discourse units in the article, and uses a fixed-size sliding window to model the coherence and order between locally co-occurring discourse units.

上下文表示学习模块按照语篇单元在文档中的顺序将其原始特征表示排列为特征矩阵X⁽⁰⁾∈R^|U|*m，本发明使用1-DTextCNN，通过滑动窗口w在连续语篇单元(EDU)之间卷积的方式捕捉其局部上下文语义相关关系。具体过程为：假设Text-CNN有滤波器w∈R^k*m，窗口大小为k，表明同时有k个EDU在窗口中。定义m′个这样的滤波器，并且设置填充操作pad防止卷积过程中丢失数据，默认为1。然后将滤波器应用到窗口上，依次从第一个EDU滑动到最后一个EDU，在m’次卷积后获得了语篇单元的局部上下文表示X_C∈R^|U|*m′。The context representation learning module arranges its original feature representations into a feature matrix X ⁽⁰⁾ ∈ R ^|U|*m according to the order of the text units in the document. The present invention uses 1-DTextCNN, through the sliding window w in the continuous text units (EDU) in a convolutional manner to capture their local contextual semantic correlations. The specific process is: Suppose Text-CNN has a filter w∈R ^k*m and the window size is k, indicating that there are k EDUs in the window at the same time. Define m' such filters, and set the padding operation pad to prevent data loss during the convolution process, the default is 1. Then the filter is applied to the window, sliding from the first EDU to the last EDU in turn, and the local context representation of the discourse unit X _C ∈R ^|U|*m′ is obtained after m' convolutions.

将该局部上下文表示X_C与结构表示学习模块得到的全局结构表示X_G进行拼接，得到包含两种文本结构信息的语篇单元高级表示X_GC∈R^|U|*m′。The local context representation X _C is spliced with the global structure representation X _G obtained by the structure representation learning module, and the high-level representation X _GC ∈ R ^|U|*m′ of the discourse unit containing the two kinds of text structure information is obtained.

模块三、文档表示生成。设计了一种融合全局注意力机制的门控递归单元(GRU-GlobalAttention)，将文档的所有语篇单元融合形成文档表示。Module three, document representation generation. A Gated Recurrent Unit (GRU-GlobalAttention) incorporating global attention mechanism is designed to fuse all discourse units of a document to form a document representation.

文档表示生成模块由一个融合全局注意力机制的门控递归单元组成。门控递归单元(GRU)顺序模型首先对整篇文档的语篇单元高级表示按照由上到下的全局阅读顺序进行再学习，在文档全局角度表达语序信息。由于该网络拥有更新门和重置门，能够较好的保留历史信息，防止学习过程中信息丢失。假设GRU网络共有T个时间步，每个时间步输入一个语篇单元表示，第t个时间步时，获得GRU的隐藏状态h_t计算方式如下：The document representation generation module consists of a gated recurrent unit fused with a global attention mechanism. The Gated Recurrent Unit (GRU) sequential model firstly re-learns the high-level representation of discourse units of the entire document according to the global reading order from top to bottom, and expresses word order information from the global perspective of the document. Since the network has an update gate and a reset gate, it can better retain historical information and prevent information loss during the learning process. Assuming that the GRU network has T time steps in total, each time step is represented by a discourse unit input. At the t-th time step, the hidden state h _t of the GRU is obtained as follows:

其中x_t为当前时刻的输入，h_t-1为t-1时刻的隐藏层输出。公式中重置门r_t表明上一时刻隐藏层记忆在当前记忆中的重要程度。更新门Z_t表明当前的记忆增量

和过去记忆h_t-1的比例。W_r，W_z，W_h是待学习的参数。经过T个时间步后，得到所有时间步的隐藏状态表示h₁，...，h_T。由于并不是所有隐藏状态对于最终生成的文档表示都有相同重要性。本发明使用全局注意力在融合所有隐藏状态并生成文档表示。where x _t is the input at the current time, and h _t-1 is the output of the hidden layer at time t-1. The reset gate _rt in the formula indicates the importance of the hidden layer memory in the current memory at the previous moment. The update gate Z _t indicates the current memory increment

and the ratio of past memory h _t-1 . W _r , W _z , W _h are parameters to be learned. After T time steps, the hidden state representations h ₁ , . . . , h _T of all time steps are obtained. Since not all hidden states are equally important to the final generated document representation. The present invention uses global attention to fuse all hidden states and generate document representations.

全局注意力考虑所有隐藏状态的重要性权重。对于第s个隐藏状态，它在文档中的权重a_t(s)由s时刻的状态h_s与最后一个时刻T输出的状态hT计算得到，计算公式如下：Global attention considers the importance weights of all hidden states. For the s-th hidden state, its weight a _t (s) in the document is calculated from the state h _s at time s and the state hT output at the last time T. The calculation formula is as follows:

模块四、虚假信息检测。将生成的文档表示用于虚假信息检测，得到该输入文档是否属于虚假信息的概率值。Module four, false information detection. The generated document representation is used for false information detection, and the probability value of whether the input document belongs to false information is obtained.

本发明使用带有soffmax激活函数的全连接层将该待测文档表示Z映射为是否属于虚假信息的概率值。The present invention uses a fully connected layer with a soffmax activation function to map the document representation Z to be tested to a probability value of whether it belongs to false information.

其中

表示新闻为真或假的预测标签概率值，W_f是权重，b_f是偏置项。定义交叉熵损失函数为：in

represents the predicted label probability value that the news is true or false, W _f is the weight, and b _f is the bias term. The cross entropy loss function is defined as:

θ是整个算法网络的参数，Y∈{0，1}是真实标签取值。θ is the parameter of the entire algorithm network, and Y∈{0, 1} is the true label value.

本发明以文档的最小语篇单元为计算对象，基于文本的结构信息实现虚假信息的有效早期检测，基于修辞结构理论(RST)构建文档语篇结构图，通过多关系型图神经网络得到语篇单元的全局结构表示；以语篇单元在文档中的位置相邻关系为计算对象，利用TextCNN得到语篇单元的局部上下文表示，并通过设计一种融合全局注意力机制的门控递归单元，将文档的所有语篇单元融合形成文档表示。The invention takes the smallest discourse unit of the document as the calculation object, realizes the effective early detection of false information based on the structural information of the text, constructs the discourse structure diagram of the document based on the rhetorical structure theory (RST), and obtains the discourse through the multi-relational graph neural network. The global structural representation of the unit; taking the positional adjacent relationship of the discourse unit in the document as the calculation object, using TextCNN to obtain the local context representation of the discourse unit, and designing a gated recursive unit that integrates the global attention mechanism, the All discourse units of a document are fused to form a document representation.

最后应说明的是：以上各实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述各实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features thereof can be equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present invention. scope.

Claims

1. A false information early detection method based on a text structure algorithm is characterized in that: the method comprises the following detection steps:

the method comprises the following steps: establishing a first calculation module for acquiring the language reading unit of the document, and segmenting the document to be detected by the first calculation module to obtain the minimum language reading unit EDU of the document;

step two: establishing a second calculation module for representation learning of the language piece unit;

step 2.1: carrying out structural representation learning of the language unit based on a language modifying structure, constructing a document language structure diagram based on a modifying structure theory (RST), and obtaining global structural representation of the language unit through a multi-relational graph neural network;

constructing a document language structure diagram by using a dependency structure:

step 2.1.1: setting document D to be detected _i The structure of the sentence is G _i (V, E), where the node set is expressed as:

represents | U | nodes in the diagram, each node being a minimum phrase unit EDU;

defining the link between every two units of language according to a specific retrieval relationship, wherein the retrieval relationship belongs to a set R', and the expression of a link set edge set E between the units of language is as follows:

E＝{(EDU _u ，r，EDU _v )|EDU _u ∈V，EDU _v ∈V，r∈R′}；

step 2.1.2: defining an adjacency matrix and a feature matrix of the language structure diagram for calculating and learning structure representation, wherein the adjacency matrix describes a topological structure between nodes, and the feature matrix describes feature representation of the nodes;

defining the adjacency matrix as A ∈ R ^|U|*|U| Any element A in the matrix _uv Expressed as:

defining the feature matrix as:

X ⁽⁰⁾ ∈R ^|U|*m ；

in the above formula, the vector of the unit node u of any language is represented as x _u ∈R ^m M is the dimension of the term unit vector, and assuming that an edge of type | R' | is connected with the term unit vector, the node u is represented as a neighbor set based on any type of edge R

Step 2.1.3: the method adopts the multiple relation type graph attention network RGAT to highlight the information of key neighbors by using graph attention in the process of transmitting and aggregating the neighbor information according to the text structure, and the process of adopting the multiple relation type graph attention network RGAT to update the feature representation of the unit node u of the language is as follows:

firstly, obtaining the relative importance of all neighbors connected by the node based on a relation r, distributing different attention weights, then aggregating neighbor information according to the weights to obtain a node feature representation of the node u belonging to the relation, and finally integrating all adjacent relations of the node u to obtain a feature representation containing rich multi-relation structure information;

step 2.2: performing context expression learning based on the context sequence relation, taking the position adjacent relation of the language piece unit in the document as a calculation object, and obtaining the local context expression of the language piece unit based on the TextCNN;

the context representation learning module arranges the original feature representations into a feature matrix X according to the sequence of the language part units in the document ⁽⁰⁾ ∈R ^|U|*m Capturing the correlation relationship of local context semantics by using a mode that 1-DTextCNN is convoluted between continuous language units (EDUs) through a sliding window w;

step three: generating a document representation, fusing all language units of the document to form the document representation based on a gating recursion unit fused with a global attention mechanism;

step four: and detecting false information, and using the generated document representation for false information detection to obtain a probability value of whether the input document belongs to the false information.

2. The method for early detection of false information based on text structure algorithm as claimed in claim 1, wherein: the specific process steps for establishing the first computing module in the first step are as follows:

and performing sentence segmentation and word segmentation on the document to be detected by using a StanfordCorNLP tool, and inputting the segmented text into a DPLP tool to obtain all language segment units EDUs of the document.

3. The method for early detection of false information based on text structure algorithm as claimed in claim 2, wherein: the specific process of adopting the characteristic representation of the unit node u of the multiple relational graph attention network RGAT updating language in the step 2.1.3 is as follows:

if RGAT is composed of l layers, each layer transfers the obtained feature matrix to the next layer, the feature of node u in the l layer represents x _u ^(l) The expression of (a) is:

in the formula

The node u is represented by a feature vector of a neighbor v adjacent to the node u based on a relation r at the (l-1) th layer; w is a ^r The parameter matrix is a parameter matrix of a specific type relation edge r, and learning and optimization are obtained in the network training process; ReLU is an activation function, using leakrelu to increase convergence speed to prevent gradient vanishing;

in the above formula

For measuring the importance of the neighbors v connected based on the relation r with respect to the node u,

the calculation formula of (2) is as follows:

in the formula

In order to reduce the influence caused by excessive parameters, the deviation decomposition is used for reducing the model parameters to obtain the global structure representation X of the language unit _G ∈R ^|U|*m′ Where m' is the dimension of the node representation after the attention network update of the multi-relational graph.

4. The method according to claim 3, wherein the method comprises: the specific process of capturing the local context semantic correlation relationship in the step 2.2 is as follows: let Text-CNN have a filter w ∈ R ^k*m The window size is k, which indicates that k EDUs are in the window at the same time, and the vector dimension of each EDU is m;

defining the filter w ∈ R ^k*m M 'and a padding operation pad is set to prevent data loss during convolution, then a filter is applied to the window, sliding from the first EDU to the last in turn, and obtaining a local context representation X of the term unit after m' convolutions _C ∈R ^|U|*m′ ；

Representing local context by X _C Global structure representation X obtained by structure representation learning module _G Splicing to obtain high-level representation X of language units containing two text structure information _GC ∈R ^|U|*m′ 。

5. The method according to claim 4, wherein the method comprises: the third step of generating the document representation specifically comprises the following steps:

defining a gating recursion unit fused with a global attention mechanism, and forming a document representation generation module according to the unit;

the Gating Recursion Unit (GRU) sequence model firstly relearns the high-level expression of the language units of the whole document according to the global reading sequence from top to bottom and expresses language sequence information at the global angle of the document;

setting a GRU network to have T time steps, inputting a language unit representation in each time step, and acquiring the hidden state h of the GRU at the T time step _t The calculation method of (c) is as follows:

in the formula x _t For input at the current time, h _t-1 Reset gate r for hidden layer output at time t-1 _t Indicating the importance of the hidden layer memory in the current memory at the previous time, and updating the gate Z _t Indicating the current memory increment

And past memory h _t-1 Ratio of (A) W _r ，W _z ，W _h Is a parameter to be learned;

after T time steps, obtaining the hidden state of all the time steps as h ₁ ，...，h _T ；

Fusing all hidden states and generating a document representation using global attention that considers importance weights of all hidden states, for the s-th hidden state, weight a in the document _t (s) state h from time s _s State h output from last time T _T And calculating according to the following formula:

the resulting document after weighting is represented as follows:

6. the method of claim 5, wherein the method comprises: the specific process of detecting the false information in the fourth step is as follows:

using a full connection layer with a softmax activation function, mapping the representation Z of the document to be tested into a probability value of whether the representation Z belongs to false information, wherein the calculation formula of the probability value is as follows:

in the formula

Predictive tag probability value, W, indicating whether news is true or false _f Is a weight, b _f Is a bias term;

define the cross entropy loss function as:

in the formula, theta is a parameter of the whole algorithm network, and Y belongs to {0, 1} is a real label value.