WO2024066041A1 - Electronic letter of guarantee automatic generation method and apparatus based on sequence adversary and priori reasoning - Google Patents
Electronic letter of guarantee automatic generation method and apparatus based on sequence adversary and priori reasoning Download PDFInfo
- Publication number
- WO2024066041A1 WO2024066041A1 PCT/CN2022/137058 CN2022137058W WO2024066041A1 WO 2024066041 A1 WO2024066041 A1 WO 2024066041A1 CN 2022137058 W CN2022137058 W CN 2022137058W WO 2024066041 A1 WO2024066041 A1 WO 2024066041A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- guarantee
- electronic
- generator
- sequence
- input
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
Definitions
- the present invention relates to the field of electronic letters of guarantee, and in particular to a method and device for automatically generating an electronic letter of guarantee based on sequence adversarial and priori reasoning.
- a guarantee letter refers to a written credit guarantee certificate issued by a bank, insurance company, guarantee company or individual to a third party at the request of an applicant.
- a guarantee letter refers to a written credit guarantee certificate issued by a bank, insurance company, guarantee company or individual to a third party at the request of an applicant.
- the timeliness tracking and classified information query of the guarantee letter itself still require manual intervention.
- the traditional electronic guarantee letter only changes the display form of the guarantee letter.
- the management of the guarantee letter and related business information has not changed much from the original offline paper guarantee letter.
- the entire guarantee letter issuance process requires a lot of time and manpower costs, which is inefficient.
- the embodiments of the present invention provide a method and device for automatically generating an electronic letter of guarantee based on sequence adversarial and a priori reasoning, so as to at least solve the technical problem of low efficiency in issuing existing electronic letters of guarantee.
- a method for automatically generating an electronic letter of guarantee based on sequence adversarial and prior reasoning comprising the following steps:
- Generator G receives the entire sentence input text in the initial training data through its self-attention encoder, outputs the encoder latent vector, and the latent vector is then input into the decoder in generator G;
- step S103 the method further includes:
- S104 The generator G inputs the output result to the discriminator D in the guarantee generation model to train the guarantee generation model.
- step S104 specifically includes:
- the discriminator D adopts the EMLo model.
- the EMLo model receives the serialized text generated by the generator G and outputs a vector with the same dimension as the existing topic word type.
- the probability of each type of topic is calculated through Softmax, and the topic type with the highest probability is selected for comparison with the actual topic type label.
- the discriminator D and the generator G are updated simultaneously through the cross-entropy loss back-propagation method to train the guarantee generation model.
- the discriminator D adds a penalty term when calculating the loss function, indicating whether the sentence to be judged includes a priori high-frequency words. If it does, it then determines whether they appear in order and imposes a penalty based on the difference in order, imposing a higher loss on generated sentences that do not conform to empirical rules.
- the generator G uses the ALBERT model, and the input is a text sentence and its topic type label;
- the ALBERT model is configured as: a lightweight optimized version of the classic NLP model BERT, which reduces the number of model parameters by linearly decoupling the input matrix and splitting the input matrix of the original BERT model into the product of two small matrices.
- step S102 specifically includes:
- the generator G first uses the pre-trained word vector to convert the words in the text sentence into semantic numerical vectors according to the word segmentation results, and then inputs them into the ALBERT model. It receives the entire sentence input text through its self-attention encoder and outputs the encoder latent vector, which is then input into its decoder.
- step S101 the method further includes:
- S100 Based on the experience of safety risk management of construction projects, sort out the subject types of current electronic guarantees for construction projects; collect existing electronic guarantee documents of different subject types, extract the text of the electronic guarantee documents and use open source Chinese natural language processing tools for pre-processing, and annotate the text according to the subject category to which it belongs as initial training data.
- an apparatus for automatically generating an electronic letter of guarantee based on sequence adversarial and priori reasoning comprising:
- a data input unit used to input the acquired initial training data into the generator G in the letter of guarantee generation model
- a data receiving unit is used for the generator G to receive the whole sentence input text in the initial training data through its self-attention encoder, output the encoder latent vector, and the latent vector is then input into the decoder in the generator G;
- the decoding unit is used for the decoder to predict the next word according to each word input during the decoding stage, and finally output a complete sentence for the electronic letter of guarantee.
- the device also includes:
- the training unit is used for the generator G to input the output result into the discriminator D in the guarantee generation model to train the guarantee generation model.
- the device also includes:
- the data acquisition unit is used to sort out the subject types of current electronic guarantees for construction projects based on the experience of safety risk management of construction projects; collect existing electronic guarantee documents of different subject types, extract the text of the electronic guarantee documents and use open source Chinese natural language processing tools for pre-processing, and annotate the text according to the subject category to which it belongs as initial training data.
- a storage medium storing a program file capable of implementing any one of the above-mentioned methods for automatically generating an electronic letter of guarantee based on sequence adversarial and a priori reasoning.
- a processor is used to run a program, wherein when the program is run, any one of the above-mentioned methods for automatically generating an electronic letter of guarantee based on sequence adversarial and priori reasoning is executed.
- the method and device for automatically generating an electronic letter of guarantee based on sequence adversarial and prior reasoning in the embodiments of the present invention are based on the prior knowledge in the sequence adversarial generation network and the knowledge graph.
- FIG1 is a flow chart of a method for automatically generating an electronic letter of guarantee based on sequence adversarial and prior reasoning according to the present invention
- FIG2 is a preferred flow chart of a method for automatically generating an electronic letter of guarantee based on sequence confrontation and a priori reasoning according to the present invention
- FIG3 is another preferred flow chart of the method for automatically generating an electronic letter of guarantee based on sequence confrontation and a priori reasoning according to the present invention
- FIG4 is a module diagram of a device for automatically generating an electronic letter of guarantee based on sequence adversarial and priori reasoning according to the present invention
- FIG5 is a preferred module diagram of a device for automatically generating an electronic letter of guarantee based on sequence confrontation and a priori reasoning according to the present invention
- FIG6 is another preferred module diagram of the device for automatically generating an electronic letter of guarantee based on sequence adversarial and a priori reasoning according to the present invention.
- a method for automatically generating an electronic letter of guarantee based on sequence adversarial and prior reasoning is provided, as shown in FIG1 , and includes the following steps:
- Generator G receives the entire sentence input text in the initial training data through its self-attention encoder, outputs the encoder latent vector, and the latent vector is then input into the decoder in generator G;
- the method for automatically generating an electronic letter of guarantee based on sequence adversarial and prior reasoning in the embodiment of the present invention is based on the prior knowledge in the sequence adversarial generation network and the knowledge graph.
- step S103 referring to FIG. 2 , the method further includes:
- S104 The generator G inputs the output result to the discriminator D in the guarantee generation model to train the guarantee generation model.
- step S104 specifically includes:
- the discriminator D adopts the EMLo model.
- the EMLo model receives the serialized text generated by the generator G and outputs a vector with the same dimension as the existing topic word type.
- the probability of each type of topic is calculated through Softmax, and the topic type with the highest probability is selected for comparison with the actual topic type label.
- the discriminator D and the generator G are updated simultaneously through the cross-entropy loss back-propagation method to train the guarantee generation model.
- the discriminator D adds a penalty term when calculating the loss function, which indicates whether the sentence to be judged includes a priori high-frequency words. If it does, it then determines whether they appear in order and imposes a penalty based on the difference in order, imposing a higher loss on the generated sentences that do not conform to the empirical rules.
- the generator G uses the ALBERT model, and the input is a text sentence and its topic type label; the ALBERT model is configured as a lightweight optimized version of the classic NLP model BERT, which reduces the number of model parameters by linearly decoupling the input matrix and splitting the input matrix of the original BERT model into the product of two small matrices.
- step S102 specifically includes:
- the generator G first uses the pre-trained word vector to convert the words in the text sentence into semantic numerical vectors according to the word segmentation results, and then inputs them into the ALBERT model. It receives the entire sentence input text through its self-attention encoder and outputs the encoder latent vector, which is then input into its decoder.
- the method further includes:
- S100 Based on the experience of safety risk management of construction projects, sort out the subject types of current electronic guarantees for construction projects; collect existing electronic guarantee documents of different subject types, extract the text of the electronic guarantee documents and use open source Chinese natural language processing tools for pre-processing, and annotate the text according to the subject category to which it belongs as initial training data.
- the method for automatically generating electronic letters of guarantee for construction projects based on sequence adversarial and prior reasoning is based on the prior knowledge in the sequence adversarial generation network and the knowledge graph.
- the subject types of current electronic guarantees for construction projects are sorted out. For example, according to the severity, they can be divided into death, serious injury, and minor injury; according to the type of accident, they can be divided into falling from height, being hit by objects, and electric shock, etc.; existing electronic guarantee documents of different subject types are collected, the text of the electronic guarantee documents is extracted, and pre-processed using open source Chinese natural language processing tools such as hanNLP and Jeba.
- the pre-processing mainly includes sentence segmentation, word segmentation, stop word removal, etc.
- the text is annotated according to the subject category to which it belongs as initial training data.
- the generator G uses the ALBERT model, and the input is the text sentence and its topic type label.
- the ALBERT model is a lightweight optimized version of the classic NLP model BERT. It reduces the number of model parameters by linearly decoupling the input matrix, that is, splitting the input matrix of the original BERT model into the product of two smaller matrices. The degree of optimization depends on the total number of words that the BERT model needs to process. The higher the total number, the more obvious the acceleration effect, which can generally increase the calculation speed by 8-10 times.
- word2vec or glove use pre-trained word vectors (word2vec or glove) to convert the words in the text sentence into semantic numerical vectors according to the word segmentation results, and then input them into the ALBERT model.
- This model is the same as the BERT model except for decoupling and tuning at the input end. It receives the entire sentence input text through its self-attention encoder and outputs the encoder latent vector, which is then input into its decoder.
- the decoding stage one word is input each time, and the guarantee generation model calculates the probability of all words in the corpus in the current state for selection. Then the decoder predicts the next word, thereby realizing continuous text generation. The decoder finally outputs a complete sentence for the electronic guarantee part of the construction project.
- Generator G inputs the output result into discriminator D, whose purpose is to determine whether the generated sequence comes from the real data set, so as to guide the improvement and optimization of generator G;
- discriminator D adopts EMLo model, which contains a multi-layer bidirectional long short-term memory network Bi-LSTM, which can well capture the semantics of sequence text and will not lose the information of words at earlier positions in the sequence;
- EMLo model receives the serialized text generated by generator G, outputs a vector with the same dimension as the existing topic word type, calculates the probability of each type of topic through Softmax, selects the topic type with the highest probability and compares it with the actual topic type label, and simultaneously updates discriminator D and generator G through the cross-entropy loss back propagation method, thereby realizing the training of the guarantee generation model.
- the guarantee generation model of the present invention takes into account the general structure of the engineering electronic guarantee, including various fill-in-the-blanks. For example, the blanks such as the name of the insured and the age are all included in the training data, so the model can learn and generate automatically.
- the model of the present invention adds a penalty term when the discriminator D calculates the loss function, indicating whether the sentence to be judged includes the prior high-frequency words. If included, it is then determined whether they appear in order, and penalties are imposed based on the difference in order, imposing higher losses on generated sentences that do not conform to the empirical rules.
- an apparatus for automatically generating an electronic letter of guarantee based on sequence confrontation and a priori reasoning comprising:
- the data input unit 201 is used to input the acquired initial training data into the generator G in the letter of guarantee generation model;
- a data receiving unit 202 is used for the generator G to receive the whole sentence input text in the initial training data through its self-attention encoder, output the encoder latent vector, and the latent vector is then input to the decoder in the generator G;
- the decoding unit 203 is used for the decoder to predict the next word according to each word input during the decoding stage, and finally output a complete sentence for the electronic guarantee part.
- the automatic generation device for electronic letters of guarantee based on sequence adversarial and prior reasoning in the embodiment of the present invention is based on the prior knowledge in the sequence adversarial generation network and the knowledge graph.
- the device further includes:
- the training unit 204 is used for the generator G to input the output result to the discriminator D in the guarantee generation model to train the guarantee generation model.
- the device further includes:
- the data acquisition unit 200 is used to sort out the subject types of current electronic guarantees for construction projects based on the experience of safety risk management of construction projects; collect existing electronic guarantee documents of different subject types, extract the text of the electronic guarantee documents and use open source Chinese natural language processing tools for pre-processing, and annotate the text according to the subject category to which it belongs as initial training data.
- the automatic generation device of electronic guarantee for construction projects based on sequence adversarial and prior reasoning is based on the prior knowledge in the sequence adversarial generation network and the knowledge graph.
- Data acquisition unit 200 Based on the experience of construction project safety risk management, sort out the subject types of current electronic guarantees for construction projects, such as death, serious injury, and minor injury according to severity, and fall from height, object impact, electric shock, etc. according to accident type; collect existing electronic guarantee documents of different subject types, extract the text of the electronic guarantee documents and use open source Chinese natural language processing tools such as hanNLP and Jeba for preprocessing.
- the preprocessing mainly includes sentence segmentation, word segmentation, stop word removal, etc.
- the text is annotated according to the subject category to which it belongs as initial training data.
- Data input unit 201 Given initial training data, use it to train the generator G in the guarantee generation model; the generator G uses the ALBERT model, and the input is a text sentence and its topic type label.
- the ALBERT model is a lightweight optimized version of the classic NLP model BERT. It reduces the number of model parameters by linearly decoupling the input matrix, that is, splitting the input matrix of the original BERT model into the product of two smaller matrices. The degree of optimization depends on the total amount of words that the BERT model needs to process. The higher the total amount, the more obvious the acceleration effect, which can generally increase the calculation speed by 8-10 times.
- Data receiving unit 202 First, use the pre-trained word vector (word2vec or glove) to convert the words in the text sentence into semantic numerical vectors according to the word segmentation results, and then input them into the ALBERT model.
- word2vec or glove This model is the same as the BERT model except for the decoupling tuning at the input end. It receives the entire sentence input text through its self-attention encoder and outputs the encoder latent vector, which is then input into its decoder.
- Decoding unit 203 Each time a word is input in the decoding stage, the guarantee generation model calculates the probability of all words in the corpus in the current state for selection, and then the decoder predicts the next word, thereby realizing continuous text generation, and the decoder finally outputs a complete sentence for the electronic guarantee part of the construction project.
- Training unit 204 Generator G inputs the output result into the discriminator D, the purpose of which is to determine whether the generated sequence comes from a real data set, thereby guiding the improvement and optimization of generator G; discriminator D adopts the EMLo model, which contains a multi-layer bidirectional long short-term memory network Bi-LSTM, which can well capture the semantics of the sequence text without losing the information of the words at the earlier position of the sequence; the EMLo model receives the serialized text generated by generator G, outputs a vector with the same dimension as the existing topic word type, calculates the probability of each type of topic through Softmax, selects the topic type with the highest probability and compares it with the actual topic type label, and updates the discriminator D and generator G simultaneously through the cross entropy loss back propagation method, thereby realizing the training of the guarantee generation model.
- the EMLo model receives the serialized text generated by generator G, outputs a vector with the same dimension as the existing topic word type, calculates the probability of each type of topic through Softmax, selects
- the guarantee generation model of the present invention takes into account the general structure of the engineering electronic guarantee, including various fill-in-the-blanks. For example, the blanks such as the name of the insured and the age are all included in the training data, so the model can learn and generate automatically.
- the model of the present invention adds a penalty term when the discriminator D calculates the loss function, indicating whether the sentence to be judged includes the prior high-frequency words. If included, it is then determined whether they appear in order, and penalties are imposed based on the difference in order, imposing higher losses on generated sentences that do not conform to the empirical rules.
- a storage medium storing a program file capable of implementing any one of the above-mentioned methods for automatically generating an electronic letter of guarantee based on sequence adversarial and a priori reasoning.
- a processor is used to run a program, wherein when the program is run, any one of the above-mentioned methods for automatically generating an electronic letter of guarantee based on sequence adversarial and priori reasoning is executed.
- the disclosed technical content can be implemented in other ways.
- the system embodiments described above are only schematic.
- the division of units can be a logical function division.
- multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
- Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of units or modules, which can be electrical or other forms.
- the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the present embodiment.
- each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
- the above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.
- the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
- the technical solution of the present invention in essence, or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions for a computer device (which can be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the methods of each embodiment of the present invention.
- the aforementioned storage medium includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Business, Economics & Management (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Databases & Information Systems (AREA)
- Accounting & Taxation (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Finance (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- Technology Law (AREA)
- General Business, Economics & Management (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)
- Machine Translation (AREA)
Abstract
Description
本发明涉及电子保函领域,具体而言,涉及一种基于序列对抗和先验推理的电子保函自动生成方法及装置。The present invention relates to the field of electronic letters of guarantee, and in particular to a method and device for automatically generating an electronic letter of guarantee based on sequence adversarial and priori reasoning.
担保函是指银行、保险公司、担保公司或个人应申请人的请求,向第三方开立的一种书面信用担保凭证。但是随着电子保函数量的增加、保函相关的业务信息难以集中展示,保函自身的时效性跟踪、分类信息查询仍需要人工介入,传统的电子担保函只改变了保函的展现形式,保函及相关业务信息的管理与原始的线下纸质保函没有发生太大变化,整个保函开具的流程需要投入大量的时间成本和人力成本,效率低下。A guarantee letter refers to a written credit guarantee certificate issued by a bank, insurance company, guarantee company or individual to a third party at the request of an applicant. However, with the increase in the number of electronic guarantee letters, it is difficult to display the business information related to the guarantee letter in a centralized manner. The timeliness tracking and classified information query of the guarantee letter itself still require manual intervention. The traditional electronic guarantee letter only changes the display form of the guarantee letter. The management of the guarantee letter and related business information has not changed much from the original offline paper guarantee letter. The entire guarantee letter issuance process requires a lot of time and manpower costs, which is inefficient.
本发明实施例提供了一种基于序列对抗和先验推理的电子保函自动生成方法及装置,以至少解决现有电子保函开具效率低下的技术问题。The embodiments of the present invention provide a method and device for automatically generating an electronic letter of guarantee based on sequence adversarial and a priori reasoning, so as to at least solve the technical problem of low efficiency in issuing existing electronic letters of guarantee.
根据本发明的一实施例,提供了一种基于序列对抗和先验推理的电子保函自动生成方法,包括以下步骤:According to an embodiment of the present invention, a method for automatically generating an electronic letter of guarantee based on sequence adversarial and prior reasoning is provided, comprising the following steps:
S101:将获取到的初始训练数据,输入至保函生成模型中的生成器G;S101: Input the acquired initial training data into the generator G in the letter of guarantee generation model;
S102:生成器G通过其自注意力编码器接收初始训练数据中的整句输入文本,输出编码器隐向量,隐向量再输入至生成器G内的解码器;S102: Generator G receives the entire sentence input text in the initial training data through its self-attention encoder, outputs the encoder latent vector, and the latent vector is then input into the decoder in generator G;
S103:解码器在解码阶段中,根据每次输入的一个字词就预测下一个字词,最终输出为一句完整的用于电子保函部分的语句。S103: During the decoding stage, the decoder predicts the next word based on each word input, and finally outputs a complete sentence for the electronic letter of guarantee.
进一步地,在步骤S103之后,方法还包括:Furthermore, after step S103, the method further includes:
S104:生成器G将输出结果输入至保函生成模型中的判别器D,对保函生成模型进行训练。S104: The generator G inputs the output result to the discriminator D in the guarantee generation model to train the guarantee generation model.
进一步地,步骤S104具体包括:Furthermore, step S104 specifically includes:
判别器D采用EMLo模型,EMLo模型接收生成器G产生的序列化文本,输出维度和现有主题词类型相同的向量,通过Softmax计算每类主题概率,选择最高概率的主题类型与实际主题类型标签对比,通过交叉熵损失反向传播方法同时更新判别器D和生成器G,对保函生成模型进行训练。The discriminator D adopts the EMLo model. The EMLo model receives the serialized text generated by the generator G and outputs a vector with the same dimension as the existing topic word type. The probability of each type of topic is calculated through Softmax, and the topic type with the highest probability is selected for comparison with the actual topic type label. The discriminator D and the generator G are updated simultaneously through the cross-entropy loss back-propagation method to train the guarantee generation model.
进一步地,判别器D计算损失函数时增加一惩罚项,表示待判别的句子是否包括先验高频词,如果包括,再判断是否按顺序出现,根据其顺序差异做出惩罚,对不符合经验规则的生成句子施加更高的损失。Furthermore, the discriminator D adds a penalty term when calculating the loss function, indicating whether the sentence to be judged includes a priori high-frequency words. If it does, it then determines whether they appear in order and imposes a penalty based on the difference in order, imposing a higher loss on generated sentences that do not conform to empirical rules.
进一步地,生成器G使用ALBERT模型,输入为文本句子与其主题类型标签;ALBERT模型被配置为:经典NLP模型BERT的轻量级优化版本,其通过对输入矩阵进行线性解耦,把原始BERT模型的输入矩阵拆分为两个小矩阵的乘积的方式减小模型参数数量。Furthermore, the generator G uses the ALBERT model, and the input is a text sentence and its topic type label; the ALBERT model is configured as: a lightweight optimized version of the classic NLP model BERT, which reduces the number of model parameters by linearly decoupling the input matrix and splitting the input matrix of the original BERT model into the product of two small matrices.
进一步地,步骤S102具体包括:Furthermore, step S102 specifically includes:
生成器G首先使用预训练词向量将文本句子中的字词根据分词结果转换为语义数值向量,然后输入至ALBERT模型,通过其自注意力编码器接收整句输入文本,输出编码器隐向量,隐向量再输入至其解码器。The generator G first uses the pre-trained word vector to convert the words in the text sentence into semantic numerical vectors according to the word segmentation results, and then inputs them into the ALBERT model. It receives the entire sentence input text through its self-attention encoder and outputs the encoder latent vector, which is then input into its decoder.
进一步地,在步骤S101之前,方法还包括:Furthermore, before step S101, the method further includes:
S100:按建设项目安全风险管理经验,梳理当前建设工程电子保函的主题类型;采集不同主题类型的已有电子保函文件,提取电子保函文件的文本并使用开源中文自然语言处理工具进行预处理,按文本所属主题类别进行标注,作为初始训练数据。S100: Based on the experience of safety risk management of construction projects, sort out the subject types of current electronic guarantees for construction projects; collect existing electronic guarantee documents of different subject types, extract the text of the electronic guarantee documents and use open source Chinese natural language processing tools for pre-processing, and annotate the text according to the subject category to which it belongs as initial training data.
根据本发明的另一实施例,提供了一种基于序列对抗和先验推理的电子保函自动生成装置,包括:According to another embodiment of the present invention, there is provided an apparatus for automatically generating an electronic letter of guarantee based on sequence adversarial and priori reasoning, comprising:
数据输入单元,用于将获取到的初始训练数据,输入至保函生成模型中的生成器G;A data input unit, used to input the acquired initial training data into the generator G in the letter of guarantee generation model;
数据接收单元,用于生成器G通过其自注意力编码器接收初始训练数据中的整句输入文本,输出编码器隐向量,隐向量再输入至生成器G内的解码器;A data receiving unit is used for the generator G to receive the whole sentence input text in the initial training data through its self-attention encoder, output the encoder latent vector, and the latent vector is then input into the decoder in the generator G;
解码单元,用于解码器在解码阶段中,根据每次输入的一个字词就预测下一个字词,最终输出为一句完整的用于电子保函部分的语句。The decoding unit is used for the decoder to predict the next word according to each word input during the decoding stage, and finally output a complete sentence for the electronic letter of guarantee.
进一步地,装置还包括:Furthermore, the device also includes:
训练单元,用于生成器G将输出结果输入至保函生成模型中的判别器D,对保函生成模型进行训练。The training unit is used for the generator G to input the output result into the discriminator D in the guarantee generation model to train the guarantee generation model.
进一步地,装置还包括:Furthermore, the device also includes:
数据获取单元,用于按建设项目安全风险管理经验,梳理当前建设工程电子保函的主题类型;采集不同主题类型的已有电子保函文件,提取电子保函文件的文本并使用开源中文自然语言处理工具进行预处理,按文本所属主题类别进行标注,作为初始训练数据。The data acquisition unit is used to sort out the subject types of current electronic guarantees for construction projects based on the experience of safety risk management of construction projects; collect existing electronic guarantee documents of different subject types, extract the text of the electronic guarantee documents and use open source Chinese natural language processing tools for pre-processing, and annotate the text according to the subject category to which it belongs as initial training data.
一种存储介质,存储介质存储有能够实现上述任意一项基于序列对抗和先验推理的电子保函自动生成方法的程序文件。A storage medium storing a program file capable of implementing any one of the above-mentioned methods for automatically generating an electronic letter of guarantee based on sequence adversarial and a priori reasoning.
一种处理器,处理器用于运行程序,其中,程序运行时执行上述任意一项的基于序列对抗和先验推理的电子保函自动生成方法。A processor is used to run a program, wherein when the program is run, any one of the above-mentioned methods for automatically generating an electronic letter of guarantee based on sequence adversarial and priori reasoning is executed.
本发明实施例中的基于序列对抗和先验推理的电子保函自动生成方法及装置,基于序列对抗生成网络和知识图谱中的先验知识,通过学习电子自然语言书写的电子保函文本内容的形式、风格、结构,实现给定特定保险种类的自动保函生成,节约保险管理人员时间成本。The method and device for automatically generating an electronic letter of guarantee based on sequence adversarial and prior reasoning in the embodiments of the present invention are based on the prior knowledge in the sequence adversarial generation network and the knowledge graph. By learning the form, style, and structure of the electronic letter of guarantee text content written in electronic natural language, it can realize automatic generation of a letter of guarantee for a given specific insurance type, thereby saving the time cost of insurance managers.
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The drawings described herein are used to provide a further understanding of the present invention and constitute a part of this application. The exemplary embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute an improper limitation of the present invention. In the drawings:
图1为本发明基于序列对抗和先验推理的电子保函自动生成方法的流程图;FIG1 is a flow chart of a method for automatically generating an electronic letter of guarantee based on sequence adversarial and prior reasoning according to the present invention;
图2为本发明基于序列对抗和先验推理的电子保函自动生成方法的一优选流程图;FIG2 is a preferred flow chart of a method for automatically generating an electronic letter of guarantee based on sequence confrontation and a priori reasoning according to the present invention;
图3为本发明基于序列对抗和先验推理的电子保函自动生成方法的另一优选流程图;FIG3 is another preferred flow chart of the method for automatically generating an electronic letter of guarantee based on sequence confrontation and a priori reasoning according to the present invention;
图4为本发明基于序列对抗和先验推理的电子保函自动生成装置的模块图;FIG4 is a module diagram of a device for automatically generating an electronic letter of guarantee based on sequence adversarial and priori reasoning according to the present invention;
图5为本发明基于序列对抗和先验推理的电子保函自动生成装置的一优选模块图;FIG5 is a preferred module diagram of a device for automatically generating an electronic letter of guarantee based on sequence confrontation and a priori reasoning according to the present invention;
图6为本发明基于序列对抗和先验推理的电子保函自动生成装置的另一优选模块图。FIG6 is another preferred module diagram of the device for automatically generating an electronic letter of guarantee based on sequence adversarial and a priori reasoning according to the present invention.
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。In order to enable those skilled in the art to better understand the scheme of the present invention, the technical scheme in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work should fall within the scope of protection of the present invention.
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", etc. in the specification and claims of the present invention and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the data used in this way can be interchanged where appropriate, so that the embodiments of the present invention described herein can be implemented in an order other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, for example, a process, method, system, product or device that includes a series of steps or units is not necessarily limited to those steps or units clearly listed, but may include other steps or units that are not clearly listed or inherent to these processes, methods, products or devices.
实施例1Example 1
根据本发明一实施例,提供了一种基于序列对抗和先验推理的电子保函自动生成方法,参见图1,包括以下步骤:According to an embodiment of the present invention, a method for automatically generating an electronic letter of guarantee based on sequence adversarial and prior reasoning is provided, as shown in FIG1 , and includes the following steps:
S101:将获取到的初始训练数据,输入至保函生成模型中的生成器G;S101: Input the acquired initial training data into the generator G in the letter of guarantee generation model;
S102:生成器G通过其自注意力编码器接收初始训练数据中的整句输入文本,输出编码器隐向量,隐向量再输入至生成器G内的解码器;S102: Generator G receives the entire sentence input text in the initial training data through its self-attention encoder, outputs the encoder latent vector, and the latent vector is then input into the decoder in generator G;
S103:解码器在解码阶段中,根据每次输入的一个字词就预测下一个字词,最终输出为一句完整的用于电子保函部分的语句。S103: During the decoding stage, the decoder predicts the next word based on each word input, and finally outputs a complete sentence for the electronic letter of guarantee.
本发明实施例中的基于序列对抗和先验推理的电子保函自动生成方法,基于序列对抗生成网络和知识图谱中的先验知识,通过学习电子自然语言书写的电子保函文本内容的形式、风格、结构,实现给定特定保险种类的自动保函生成,节约保险管理人员时间成本。The method for automatically generating an electronic letter of guarantee based on sequence adversarial and prior reasoning in the embodiment of the present invention is based on the prior knowledge in the sequence adversarial generation network and the knowledge graph. By learning the form, style, and structure of the electronic letter of guarantee text content written in electronic natural language, it can realize automatic generation of letters of guarantee for a given specific insurance type, thereby saving the time cost of insurance managers.
其中,在步骤S103之后,参见图2,方法还包括:After step S103, referring to FIG. 2 , the method further includes:
S104:生成器G将输出结果输入至保函生成模型中的判别器D,对保函生成模型进行训练。S104: The generator G inputs the output result to the discriminator D in the guarantee generation model to train the guarantee generation model.
其中,步骤S104具体包括:Wherein, step S104 specifically includes:
判别器D采用EMLo模型,EMLo模型接收生成器G产生的序列化文本,输出维度和现有主题词类型相同的向量,通过Softmax计算每类主题概率,选择最高概率的主题类型与实际主题类型标签对比,通过交叉熵损失反向传播方法同时更新判别器D和生成器G,对保函生成模型进行训练。The discriminator D adopts the EMLo model. The EMLo model receives the serialized text generated by the generator G and outputs a vector with the same dimension as the existing topic word type. The probability of each type of topic is calculated through Softmax, and the topic type with the highest probability is selected for comparison with the actual topic type label. The discriminator D and the generator G are updated simultaneously through the cross-entropy loss back-propagation method to train the guarantee generation model.
其中,判别器D计算损失函数时增加一惩罚项,表示待判别的句子是否包括先验高频词,如果包括,再判断是否按顺序出现,根据其顺序差异做出惩罚,对不符合经验规则的生成句子施加更高的损失。Among them, the discriminator D adds a penalty term when calculating the loss function, which indicates whether the sentence to be judged includes a priori high-frequency words. If it does, it then determines whether they appear in order and imposes a penalty based on the difference in order, imposing a higher loss on the generated sentences that do not conform to the empirical rules.
其中,生成器G使用ALBERT模型,输入为文本句子与其主题类型标签;ALBERT模型被配置为:经典NLP模型BERT的轻量级优化版本,其通过对输入矩阵进行线性解耦,把原始BERT模型的输入矩阵拆分为两个小矩阵的乘积的方式减小模型参数数量。The generator G uses the ALBERT model, and the input is a text sentence and its topic type label; the ALBERT model is configured as a lightweight optimized version of the classic NLP model BERT, which reduces the number of model parameters by linearly decoupling the input matrix and splitting the input matrix of the original BERT model into the product of two small matrices.
其中,步骤S102具体包括:Wherein, step S102 specifically includes:
生成器G首先使用预训练词向量将文本句子中的字词根据分词结果转换为语义数值向量,然后输入至ALBERT模型,通过其自注意力编码器接收整句输入文本,输出编码器隐向量,隐向量再输入至其解码器。The generator G first uses the pre-trained word vector to convert the words in the text sentence into semantic numerical vectors according to the word segmentation results, and then inputs them into the ALBERT model. It receives the entire sentence input text through its self-attention encoder and outputs the encoder latent vector, which is then input into its decoder.
其中,在步骤S101之前,参见图3,方法还包括:Before step S101, referring to FIG3 , the method further includes:
S100:按建设项目安全风险管理经验,梳理当前建设工程电子保函的主题类型;采集不同主题类型的已有电子保函文件,提取电子保函文件的文本并使用开源中文自然语言处理工具进行预处理,按文本所属主题类别进行标注,作为初始训练数据。S100: Based on the experience of safety risk management of construction projects, sort out the subject types of current electronic guarantees for construction projects; collect existing electronic guarantee documents of different subject types, extract the text of the electronic guarantee documents and use open source Chinese natural language processing tools for pre-processing, and annotate the text according to the subject category to which it belongs as initial training data.
下面以具体实施例,对本发明的基于序列对抗和先验推理的电子保函自动生成方法进行详细说明:The following is a detailed description of the method for automatically generating an electronic letter of guarantee based on sequence confrontation and prior reasoning according to the present invention using a specific embodiment:
本发明的基于序列对抗和先验推理的建设工程电子保函自动生成方法基于序列对抗生成网络和知识图谱中的先验知识,通过学习电子自然语言书写的电子保函文本内容的形式、风格、结构,实现给定特定保险种类(即主题)的自动保函生成,节约保险管理人员时间成本。具体包括如下步骤:The method for automatically generating electronic letters of guarantee for construction projects based on sequence adversarial and prior reasoning is based on the prior knowledge in the sequence adversarial generation network and the knowledge graph. By learning the form, style, and structure of the electronic letter of guarantee text written in electronic natural language, it can realize automatic letter of guarantee generation for a given specific insurance type (i.e., subject), saving the time cost of insurance managers. It specifically includes the following steps:
按建设项目安全风险管理经验,梳理当前建设工程电子保函的主题类型,如按严重程度可以分为死亡、重伤、轻伤,按事故类型可以分为高空坠落、物体打击、触电等;采集不同主题类型的已有电子保函文件,提取电子保函文件的文本并使用开源中文自然语言处理工具如hanNLP和Jeba等进行预处理,预处理主要包括分句、分词、停用词去除等,按文本所属主题类别进行标注,作为初始训练数据。Based on the experience of safety risk management of construction projects, the subject types of current electronic guarantees for construction projects are sorted out. For example, according to the severity, they can be divided into death, serious injury, and minor injury; according to the type of accident, they can be divided into falling from height, being hit by objects, and electric shock, etc.; existing electronic guarantee documents of different subject types are collected, the text of the electronic guarantee documents is extracted, and pre-processed using open source Chinese natural language processing tools such as hanNLP and Jeba. The pre-processing mainly includes sentence segmentation, word segmentation, stop word removal, etc. The text is annotated according to the subject category to which it belongs as initial training data.
给定初始训练数据,用其训练保函生成模型中的生成器G;生成器G使用ALBERT模型,输入为文本句子与其主题类型标签。ALBERT模型是经典NLP模型BERT的轻量级优化版本,其通过对输入矩阵进行线性解耦,即把原始BERT模型的输入矩阵拆分为两个较小矩阵的乘积的方式减小模型参数数量,其优化程度取决于BERT模型需要处理的字词总量,总量越高加速效果越明显,一般可将计算速度提高8-10倍。Given the initial training data, use it to train the generator G in the guarantee generation model; the generator G uses the ALBERT model, and the input is the text sentence and its topic type label. The ALBERT model is a lightweight optimized version of the classic NLP model BERT. It reduces the number of model parameters by linearly decoupling the input matrix, that is, splitting the input matrix of the original BERT model into the product of two smaller matrices. The degree of optimization depends on the total number of words that the BERT model needs to process. The higher the total number, the more obvious the acceleration effect, which can generally increase the calculation speed by 8-10 times.
首先使用预训练词向量(word2vec或glove)将文本句子中的字词根据分词结果转换为语义数值向量,然后输入至ALBERT模型,该模型除在输入端做解耦调优之外其他部分和BERT模型相同,通过其自注意力编码器(self-attention)接收整句输入文本,输出编码器隐向量,隐向量再输入至其解码器(decoder),在解码阶段每次输入一个字词,保函生成模型计算所有语料库内字词在当前状态下的概率进行挑选,然后解码器预测下一个字词,从而实现连续文本生成,解码器最终输出为一句完整的用于建设工程电子保函部分的语句。生成器G将输出结果输入判别器D,其目的是判断生成得到的序列是否来自于真实数据集,从而引导生成器G的改进优化;判别器D采用EMLo模型,其包含多层双向长短时记忆网络Bi-LSTM可以很好捕捉序列文本语义,不会丢失掉序列较早位置处字词的信息;EMLo模型接收生成器G产生的序列化文本,输出维度和现有主题词类型相同的向量,通过Softmax计算每类主题概率,选择最高概率的主题类型与实际主题类型标签对比,通过交叉熵损失反向传播方法同时更新判别器D和生成器G,从而实现保函生成模型的训练。First, use pre-trained word vectors (word2vec or glove) to convert the words in the text sentence into semantic numerical vectors according to the word segmentation results, and then input them into the ALBERT model. This model is the same as the BERT model except for decoupling and tuning at the input end. It receives the entire sentence input text through its self-attention encoder and outputs the encoder latent vector, which is then input into its decoder. In the decoding stage, one word is input each time, and the guarantee generation model calculates the probability of all words in the corpus in the current state for selection. Then the decoder predicts the next word, thereby realizing continuous text generation. The decoder finally outputs a complete sentence for the electronic guarantee part of the construction project. Generator G inputs the output result into discriminator D, whose purpose is to determine whether the generated sequence comes from the real data set, so as to guide the improvement and optimization of generator G; discriminator D adopts EMLo model, which contains a multi-layer bidirectional long short-term memory network Bi-LSTM, which can well capture the semantics of sequence text and will not lose the information of words at earlier positions in the sequence; EMLo model receives the serialized text generated by generator G, outputs a vector with the same dimension as the existing topic word type, calculates the probability of each type of topic through Softmax, selects the topic type with the highest probability and compares it with the actual topic type label, and simultaneously updates discriminator D and generator G through the cross-entropy loss back propagation method, thereby realizing the training of the guarantee generation model.
本发明的创新点及有益效果至少在于:The innovative features and beneficial effects of the present invention are at least:
1)与一般的基于对抗生成网络文本生成模型不同,本发明的保函生成模型考虑了工程电子保函的一般结构,包括各类填空,比如,投保人姓名_____,年龄_____等空白以及空白的表述方式都包括在训练数据内,因此该模型可以学习并自动生成。1) Different from the general text generation model based on the generative adversarial network, the guarantee generation model of the present invention takes into account the general structure of the engineering electronic guarantee, including various fill-in-the-blanks. For example, the blanks such as the name of the insured and the age are all included in the training data, so the model can learn and generate automatically.
2)通过先验知识,可以知道在特定的建设工程保险领域中,某些高频词且按固定顺序出现,可以表明一个句子是合理的,而违反这种规律则大概率表示句子不符合该领域的经验规则,因此,本发明的模型在判别器D计算损失函数时增加一惩罚项,表示待判别的句子是否包括先验高频词,如果包括,再判断是否按顺序出现,根据其顺序差异做出惩罚,对不符合经验规则的生成句子施加更高的损失。2) Through prior knowledge, we can know that in the specific field of construction project insurance, certain high-frequency words appearing in a fixed order can indicate that a sentence is reasonable, and violating this rule is likely to indicate that the sentence does not conform to the empirical rules of this field. Therefore, the model of the present invention adds a penalty term when the discriminator D calculates the loss function, indicating whether the sentence to be judged includes the prior high-frequency words. If included, it is then determined whether they appear in order, and penalties are imposed based on the difference in order, imposing higher losses on generated sentences that do not conform to the empirical rules.
实施例2Example 2
根据本发明的另一实施例,提供了一种基于序列对抗和先验推理的电子保函自动生成装置,参见图4,包括:According to another embodiment of the present invention, there is provided an apparatus for automatically generating an electronic letter of guarantee based on sequence confrontation and a priori reasoning, as shown in FIG4 , comprising:
数据输入单元201,用于将获取到的初始训练数据,输入至保函生成模型中的生成器G;The data input unit 201 is used to input the acquired initial training data into the generator G in the letter of guarantee generation model;
数据接收单元202,用于生成器G通过其自注意力编码器接收初始训练数据中的整句输入文本,输出编码器隐向量,隐向量再输入至生成器G内的解码器;A data receiving unit 202 is used for the generator G to receive the whole sentence input text in the initial training data through its self-attention encoder, output the encoder latent vector, and the latent vector is then input to the decoder in the generator G;
解码单元203,用于解码器在解码阶段中,根据每次输入的一个字词就预测下一个字词,最终输出为一句完整的用于电子保函部分的语句。The decoding unit 203 is used for the decoder to predict the next word according to each word input during the decoding stage, and finally output a complete sentence for the electronic guarantee part.
本发明实施例中的基于序列对抗和先验推理的电子保函自动生成装置,基于序列对抗生成网络和知识图谱中的先验知识,通过学习电子自然语言书写的电子保函文本内容的形式、风格、结构,实现给定特定保险种类的自动保函生成,节约保险管理人员时间成本。The automatic generation device for electronic letters of guarantee based on sequence adversarial and prior reasoning in the embodiment of the present invention is based on the prior knowledge in the sequence adversarial generation network and the knowledge graph. By learning the form, style and structure of the electronic letter of guarantee text content written in electronic natural language, it can realize automatic generation of letters of guarantee for a given specific insurance type, thereby saving the time cost of insurance managers.
其中,参见图5,装置还包括:Wherein, referring to FIG5, the device further includes:
训练单元204,用于生成器G将输出结果输入至保函生成模型中的判别器D,对保函生成模型进行训练。The training unit 204 is used for the generator G to input the output result to the discriminator D in the guarantee generation model to train the guarantee generation model.
其中,参见图6,装置还包括:Wherein, referring to FIG6 , the device further includes:
数据获取单元200,用于按建设项目安全风险管理经验,梳理当前建设工程电子保函的主题类型;采集不同主题类型的已有电子保函文件,提取电子保函文件的文本并使用开源中文自然语言处理工具进行预处理,按文本所属主题类别进行标注,作为初始训练数据。The data acquisition unit 200 is used to sort out the subject types of current electronic guarantees for construction projects based on the experience of safety risk management of construction projects; collect existing electronic guarantee documents of different subject types, extract the text of the electronic guarantee documents and use open source Chinese natural language processing tools for pre-processing, and annotate the text according to the subject category to which it belongs as initial training data.
下面以具体实施例,对本发明的基于序列对抗和先验推理的电子保函自动生成装置进行详细说明:The following is a detailed description of the electronic guarantee automatic generation device based on sequence confrontation and prior reasoning of the present invention with a specific embodiment:
本发明的基于序列对抗和先验推理的建设工程电子保函自动生成装置基于序列对抗生成网络和知识图谱中的先验知识,通过学习电子自然语言书写的电子保函文本内容的形式、风格、结构,实现给定特定保险种类(即主题)的自动保函生成,节约保险管理人员时间成本。具体包括如下步骤:The automatic generation device of electronic guarantee for construction projects based on sequence adversarial and prior reasoning is based on the prior knowledge in the sequence adversarial generation network and the knowledge graph. By learning the form, style and structure of the electronic guarantee text content written in electronic natural language, it can realize the automatic generation of guarantee for a given specific insurance type (i.e., subject), saving the time cost of insurance managers. It specifically includes the following steps:
数据获取单元200:按建设项目安全风险管理经验,梳理当前建设工程电子保函的主题类型,如按严重程度可以分为死亡、重伤、轻伤,按事故类型可以分为高空坠落、物体打击、触电等;采集不同主题类型的已有电子保函文件,提取电子保函文件的文本并使用开源中文自然语言处理工具如hanNLP和Jeba等进行预处理,预处理主要包括分句、分词、停用词去除等,按文本所属主题类别进行标注,作为初始训练数据。Data acquisition unit 200: Based on the experience of construction project safety risk management, sort out the subject types of current electronic guarantees for construction projects, such as death, serious injury, and minor injury according to severity, and fall from height, object impact, electric shock, etc. according to accident type; collect existing electronic guarantee documents of different subject types, extract the text of the electronic guarantee documents and use open source Chinese natural language processing tools such as hanNLP and Jeba for preprocessing. The preprocessing mainly includes sentence segmentation, word segmentation, stop word removal, etc. The text is annotated according to the subject category to which it belongs as initial training data.
数据输入单元201:给定初始训练数据,用其训练保函生成模型中的生成器G;生成器G使用ALBERT模型,输入为文本句子与其主题类型标签。ALBERT模型是经典NLP模型BERT的轻量级优化版本,其通过对输入矩阵进行线性解耦,即把原始BERT模型的输入矩阵拆分为两个较小矩阵的乘积的方式减小模型参数数量,其优化程度取决于BERT模型需要处理的字词总量,总量越高加速效果越明显,一般可将计算速度提高8-10倍。Data input unit 201: Given initial training data, use it to train the generator G in the guarantee generation model; the generator G uses the ALBERT model, and the input is a text sentence and its topic type label. The ALBERT model is a lightweight optimized version of the classic NLP model BERT. It reduces the number of model parameters by linearly decoupling the input matrix, that is, splitting the input matrix of the original BERT model into the product of two smaller matrices. The degree of optimization depends on the total amount of words that the BERT model needs to process. The higher the total amount, the more obvious the acceleration effect, which can generally increase the calculation speed by 8-10 times.
数据接收单元202:首先使用预训练词向量(word2vec或glove)将文本句子中的字词根据分词结果转换为语义数值向量,然后输入至ALBERT模型,该模型除在输入端做解耦调优之外其他部分和BERT模型相同,通过其自注意力编码器(self-attention)接收整句输入文本,输出编码器隐向量,隐向量再输入至其解码器(decoder)。Data receiving unit 202: First, use the pre-trained word vector (word2vec or glove) to convert the words in the text sentence into semantic numerical vectors according to the word segmentation results, and then input them into the ALBERT model. This model is the same as the BERT model except for the decoupling tuning at the input end. It receives the entire sentence input text through its self-attention encoder and outputs the encoder latent vector, which is then input into its decoder.
解码单元203:在解码阶段每次输入一个字词,保函生成模型计算所有语料库内字词在当前状态下的概率进行挑选,然后解码器预测下一个字词,从而实现连续文本生成,解码器最终输出为一句完整的用于建设工程电子保函部分的语句。训练单元204:生成器G将输出结果输入判别器D,其目的是判断生成得到的序列是否来自于真实数据集,从而引导生成器G的改进优化;判别器D采用EMLo模型,其包含多层双向长短时记忆网络Bi-LSTM可以很好捕捉序列文本语义,不会丢失掉序列较早位置处字词的信息;EMLo模型接收生成器G产生的序列化文本,输出维度和现有主题词类型相同的向量,通过Softmax计算每类主题概率,选择最高概率的主题类型与实际主题类型标签对比,通过交叉熵损失反向传播方法同时更新判别器D和生成器G,从而实现保函生成模型的训练。Decoding unit 203: Each time a word is input in the decoding stage, the guarantee generation model calculates the probability of all words in the corpus in the current state for selection, and then the decoder predicts the next word, thereby realizing continuous text generation, and the decoder finally outputs a complete sentence for the electronic guarantee part of the construction project. Training unit 204: Generator G inputs the output result into the discriminator D, the purpose of which is to determine whether the generated sequence comes from a real data set, thereby guiding the improvement and optimization of generator G; discriminator D adopts the EMLo model, which contains a multi-layer bidirectional long short-term memory network Bi-LSTM, which can well capture the semantics of the sequence text without losing the information of the words at the earlier position of the sequence; the EMLo model receives the serialized text generated by generator G, outputs a vector with the same dimension as the existing topic word type, calculates the probability of each type of topic through Softmax, selects the topic type with the highest probability and compares it with the actual topic type label, and updates the discriminator D and generator G simultaneously through the cross entropy loss back propagation method, thereby realizing the training of the guarantee generation model.
本发明的创新点及有益效果至少在于:The innovative features and beneficial effects of the present invention are at least:
1)与一般的基于对抗生成网络文本生成模型不同,本发明的保函生成模型考虑了工程电子保函的一般结构,包括各类填空,比如,投保人姓名_____,年龄_____等空白以及空白的表述方式都包括在训练数据内,因此该模型可以学习并自动生成。1) Different from the general text generation model based on the generative adversarial network, the guarantee generation model of the present invention takes into account the general structure of the engineering electronic guarantee, including various fill-in-the-blanks. For example, the blanks such as the name of the insured and the age are all included in the training data, so the model can learn and generate automatically.
2)通过先验知识,可以知道在特定的建设工程保险领域中,某些高频词且按固定顺序出现,可以表明一个句子是合理的,而违反这种规律则大概率表示句子不符合该领域的经验规则,因此,本发明的模型在判别器D计算损失函数时增加一惩罚项,表示待判别的句子是否包括先验高频词,如果包括,再判断是否按顺序出现,根据其顺序差异做出惩罚,对不符合经验规则的生成句子施加更高的损失。2) Through prior knowledge, we can know that in the specific field of construction project insurance, certain high-frequency words appearing in a fixed order can indicate that a sentence is reasonable, and violating this rule is likely to indicate that the sentence does not conform to the empirical rules of this field. Therefore, the model of the present invention adds a penalty term when the discriminator D calculates the loss function, indicating whether the sentence to be judged includes the prior high-frequency words. If included, it is then determined whether they appear in order, and penalties are imposed based on the difference in order, imposing higher losses on generated sentences that do not conform to the empirical rules.
实施例3Example 3
一种存储介质,存储介质存储有能够实现上述任意一项基于序列对抗和先验推理的电子保函自动生成方法的程序文件。A storage medium storing a program file capable of implementing any one of the above-mentioned methods for automatically generating an electronic letter of guarantee based on sequence adversarial and a priori reasoning.
实施例4Example 4
一种处理器,处理器用于运行程序,其中,程序运行时执行上述任意一项的基于序列对抗和先验推理的电子保函自动生成方法。A processor is used to run a program, wherein when the program is run, any one of the above-mentioned methods for automatically generating an electronic letter of guarantee based on sequence adversarial and priori reasoning is executed.
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the above embodiments of the present invention are only for description and do not represent the advantages or disadvantages of the embodiments.
在本发明的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above embodiments of the present invention, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference can be made to the relevant descriptions of other embodiments.
在本申请所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。其中,以上所描述的系统实施例仅仅是示意性的,例如单元的划分,可以为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. Among them, the system embodiments described above are only schematic. For example, the division of units can be a logical function division. There may be other division methods in actual implementation. For example, multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of units or modules, which can be electrical or other forms.
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the present embodiment.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions for a computer device (which can be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the methods of each embodiment of the present invention. The aforementioned storage medium includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention. It should be pointed out that for ordinary technicians in this technical field, several improvements and modifications can be made without departing from the principle of the present invention. These improvements and modifications should also be regarded as the scope of protection of the present invention.
Claims (10)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211178562.6A CN115481630A (en) | 2022-09-27 | 2022-09-27 | Method and device for automatic generation of electronic letter of guarantee based on sequence confrontation and prior reasoning |
| CN202211178562.6 | 2022-09-27 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024066041A1 true WO2024066041A1 (en) | 2024-04-04 |
Family
ID=84394732
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2022/137058 Ceased WO2024066041A1 (en) | 2022-09-27 | 2022-12-06 | Electronic letter of guarantee automatic generation method and apparatus based on sequence adversary and priori reasoning |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN115481630A (en) |
| WO (1) | WO2024066041A1 (en) |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2019179100A1 (en) * | 2018-03-20 | 2019-09-26 | 苏州大学张家港工业技术研究院 | Medical text generation method based on generative adversarial network technology |
| CN111046178A (en) * | 2019-11-29 | 2020-04-21 | 北京邮电大学 | Text sequence generation method and system |
| US20200134415A1 (en) * | 2018-10-30 | 2020-04-30 | Huawei Technologies Co., Ltd. | Autoencoder-Based Generative Adversarial Networks for Text Generation |
| US20200134463A1 (en) * | 2018-10-30 | 2020-04-30 | Huawei Technologies Co., Ltd. | Latent Space and Text-Based Generative Adversarial Networks (LATEXT-GANs) for Text Generation |
| CN111858931A (en) * | 2020-07-08 | 2020-10-30 | 华中师范大学 | A text generation method based on deep learning |
| WO2021174827A1 (en) * | 2020-03-02 | 2021-09-10 | 平安科技(深圳)有限公司 | Text generation method and appartus, computer device and readable storage medium |
| WO2021223287A1 (en) * | 2020-05-06 | 2021-11-11 | 首都师范大学 | Focalgan-based short text automatic generation method, apparatus, and device, and storage medium |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108334497A (en) * | 2018-02-06 | 2018-07-27 | 北京航空航天大学 | The method and apparatus for automatically generating text |
| CN113159568A (en) * | 2021-04-19 | 2021-07-23 | 福建万川信息科技股份有限公司 | System and method for estimating insurance risk |
| CN114297473B (en) * | 2021-11-25 | 2024-10-15 | 北京邮电大学 | News event searching method and system based on multistage image-text semantic alignment model |
-
2022
- 2022-09-27 CN CN202211178562.6A patent/CN115481630A/en active Pending
- 2022-12-06 WO PCT/CN2022/137058 patent/WO2024066041A1/en not_active Ceased
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2019179100A1 (en) * | 2018-03-20 | 2019-09-26 | 苏州大学张家港工业技术研究院 | Medical text generation method based on generative adversarial network technology |
| US20200134415A1 (en) * | 2018-10-30 | 2020-04-30 | Huawei Technologies Co., Ltd. | Autoencoder-Based Generative Adversarial Networks for Text Generation |
| US20200134463A1 (en) * | 2018-10-30 | 2020-04-30 | Huawei Technologies Co., Ltd. | Latent Space and Text-Based Generative Adversarial Networks (LATEXT-GANs) for Text Generation |
| CN111046178A (en) * | 2019-11-29 | 2020-04-21 | 北京邮电大学 | Text sequence generation method and system |
| WO2021174827A1 (en) * | 2020-03-02 | 2021-09-10 | 平安科技(深圳)有限公司 | Text generation method and appartus, computer device and readable storage medium |
| WO2021223287A1 (en) * | 2020-05-06 | 2021-11-11 | 首都师范大学 | Focalgan-based short text automatic generation method, apparatus, and device, and storage medium |
| CN111858931A (en) * | 2020-07-08 | 2020-10-30 | 华中师范大学 | A text generation method based on deep learning |
Non-Patent Citations (1)
| Title |
|---|
| ZHAO JUNBO, MATHIEU MICHAEL, LECUN YANN: "ENERGY-BASED GENERATIVE ADVERSARIAL NET-WORKS", ICLR 2017, 6 March 2017 (2017-03-06), pages 1 - 17, XP093151613, Retrieved from the Internet <URL:https://arxiv.org/pdf/1609.03126.pdf> [retrieved on 20240415] * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN115481630A (en) | 2022-12-16 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Afsharizadeh et al. | Query-oriented text summarization using sentence extraction technique | |
| Zolotareva et al. | Abstractive Text Summarization using Transfer Learning. | |
| Fang et al. | Domain adaptation for sentiment classification in light of multiple sources | |
| CN114818717B (en) | Chinese named entity recognition method and system integrating vocabulary and syntax information | |
| CN108920599B (en) | Question-answering system answer accurate positioning and extraction method based on knowledge ontology base | |
| WO2024036840A1 (en) | Open-domain dialogue reply method and system based on topic enhancement | |
| Sahu et al. | Prashnottar: a Hindi question answering system | |
| CN110555206A (en) | named entity identification method, device, equipment and storage medium | |
| CN114756663A (en) | A kind of intelligent question answering method, system, device and computer readable storage medium | |
| CN109918627A (en) | Document creation method, device, electronic equipment and storage medium | |
| Fern et al. | Text counterfactuals via latent optimization and shapley-guided search | |
| CN117436457B (en) | Irony identification method, irony identification device, computing equipment and storage medium | |
| Li et al. | Abstractive financial news summarization via transformer-BiLSTM encoder and graph attention-based decoder | |
| Li et al. | Conundrums in cross-prompt automated essay scoring: Making sense of the state of the art | |
| Priyadharshan et al. | Text summarization for Tamil online sports news using NLP | |
| Alshaina et al. | Multi-document abstractive summarization based on predicate argument structure | |
| Chen et al. | Dynamic transformers provide a false sense of efficiency | |
| Płonka et al. | A comparative evaluation of the effectiveness of document splitters for large language models in legal contexts | |
| Atwan et al. | The use of stemming in the Arabic text and its impact on the accuracy of classification | |
| Xu et al. | Short text classification of chinese with label information assisting | |
| Kumar et al. | Transformer-based models for language identification: A comparative study | |
| Charitha et al. | Extractive document summarization using a supervised learning approach | |
| CN116226323A (en) | Scoring function construction method and related device for semantic retrieval | |
| CN115905535A (en) | Contract classification method, system and related equipment based on deep learning | |
| Spiccia et al. | A word prediction methodology for automatic sentence completion |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22960636 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 22960636 Country of ref document: EP Kind code of ref document: A1 |