CN116433799A

CN116433799A - A flow chart generation method and device based on semantic similarity and subgraph matching

Info

Publication number: CN116433799A
Application number: CN202310698508.2A
Authority: CN
Inventors: 袁水平; 董丙冰; 高元鑫; 吴信东
Original assignee: Anhui Sigao Intelligent Technology Co ltd
Current assignee: Anhui Sigao Intelligent Technology Co ltd
Priority date: 2023-06-14
Filing date: 2023-06-14
Publication date: 2023-07-14
Anticipated expiration: 2043-06-14
Also published as: CN116433799B

Abstract

The invention discloses a flow chart generation method based on semantic similarity and subgraph matching, comprising: obtaining user requirement documents and RPA project asset library documents; calculating the semantic similarity between user requirement documents and RPA project asset library documents, and according to the semantic similarity Rank the subgraphs to be matched from high to low to obtain the top-k subgraphs to be matched; construct the query knowledge graph according to the user demand document, and set a start node; search and match the subgraphs to be matched in the query knowledge graph to obtain the final best Match the graph and return the corresponding flowchart. The technical scheme of the invention reduces the time for searching and traversing the subgraph matching, doubles constraints on semantics and structure, and more accurately generates the flow chart required by the current user.

Description

A Flowchart Generation Method and Device Based on Semantic Similarity and Subgraph Matching

技术领域technical field

本发明属于数据处理技术领域，具体涉及一种基于语义相似度和子图匹配的流程图生成方法和装置。The invention belongs to the technical field of data processing, and in particular relates to a flow chart generation method and device based on semantic similarity and subgraph matching.

背景技术Background technique

对于业务流程的执行，员工目前花费大量时间处理企业资源规划(ERP)、客户关系管理(CRM)、电子表格和遗留系统，执行手动重复性任务，例如输入、复制、粘贴、提取、合并和移动大量数据从一个系统到另一个系统的数据。考虑到其中一些高度结构化、例行和手动的任务可以由机器人处理，这样知识工作者就有更多时间处理增值任务。机器人流程自动化(RPA)作为基于软件的解决方案出现，用于自动化基于规则的业务流程，这些业务流程涉及例行任务、结构化数据和确定性结果。流程图是RPA技术中重要的一环，根据用户的需求绘制流程图，再根据流程图生成执行代码来完成指定操作。流程图对准确了解事情是如何进行的，以及决定应如何改进过程极有帮助。而人工绘制流程图需要较长的时间，耗费较多的人力资源。如何利用RPA实施库中现有的流程图来为当前的用户需求自动生成一个流程图可以节省人力物力，大大提高RPA实施的效率，是目前研究的方向。For the execution of business processes, employees currently spend significant time dealing with enterprise resource planning (ERP), customer relationship management (CRM), spreadsheets, and legacy systems, performing manual repetitive tasks such as entry, copy, paste, extract, merge, and move Large amounts of data are passed from one system to another. Considering that some of these highly structured, routine, and manual tasks can be handled by robots, this leaves knowledge workers more time for value-added tasks. Robotic Process Automation (RPA) emerged as a software-based solution for automating rule-based business processes involving routine tasks, structured data, and deterministic outcomes. The flow chart is an important part of RPA technology. The flow chart is drawn according to the user's needs, and then the execution code is generated according to the flow chart to complete the specified operation. Flowcharts are extremely helpful in understanding exactly how things are going and deciding how the process should be improved. However, manual drawing of flow charts takes a long time and consumes more human resources. How to use the existing flowchart in the RPA implementation library to automatically generate a flowchart for the current user needs can save manpower and material resources, and greatly improve the efficiency of RPA implementation, which is the direction of current research.

发明内容Contents of the invention

有鉴于此，本发明提出一种基于语义相似度和子图匹配的流程图生成方法，包括以下步骤：In view of this, the present invention proposes a flow chart generation method based on semantic similarity and subgraph matching, including the following steps:

S1、获取用户需求文档和RPA项目资产库文档；S1. Obtain user requirement documents and RPA project asset library documents;

S2、计算用户需求文档和RPA项目资产库文档的语义相似度，根据语义相似度从高到低排列得到top-k的待匹配子图；S2. Calculate the semantic similarity between the user requirement document and the RPA project asset library document, and arrange top-k subgraphs to be matched according to the semantic similarity from high to low;

S3、根据用户需求文档构建查询知识图谱，并设置一个开始节点；S3. Construct a query knowledge map according to the user requirement document, and set a start node;

S4、将待匹配子图在查询知识图谱中进行搜索匹配，得到最终的最佳匹配图，返回对应的流程图。S4. Search and match the subgraphs to be matched in the query knowledge graph, obtain the final best matching graph, and return the corresponding flow chart.

本发明还提出一种基于语义相似度和子图匹配的流程图生成装置，包括：The present invention also proposes a flow chart generation device based on semantic similarity and subgraph matching, including:

处理器；processor;

存储器，其上存储有可在所述处理器上运行的计算机程序；a memory on which is stored a computer program executable on said processor;

其中，所述计算机程序被所述处理器执行时实现一种基于语义相似度和子图匹配的流程图生成方法。Wherein, when the computer program is executed by the processor, a flowchart generation method based on semantic similarity and subgraph matching is realized.

本发明提供的技术方案带来的有益效果是：The beneficial effects brought by the technical scheme provided by the invention are:

本发明提出的技术方案利用用户的需求文档通过语义相似度筛选出项目库中与当前需求相似的项目，同时将用户的需求文档构建成小的需求图使用模糊子图匹配的方法与RPA项目资产知识图谱进行匹配，找到与当前用户需求流程结构相似的项目。先使用语义进行初筛，再使用子图匹配进行搜索遍历。一方面减少了子图匹配搜索遍历的时间，另一方面从语义和结构上双重约束，更加准确的生成当前用户需求的流程图。使用生成技术根据用户需求自动生成流程图，便于用户直接使用或进行微调，简化了绘制流程图的过程，大大减少了人工干预，提高了RPA的效率。The technical solution proposed by the present invention utilizes the user's demand document to filter out items similar to the current demand in the project library through semantic similarity, and at the same time constructs the user's demand document into a small demand graph and uses the method of fuzzy subgraph matching to match RPA project assets The knowledge map is matched to find items similar to the current user demand process structure. First use semantics for preliminary screening, and then use subgraph matching for search traversal. On the one hand, it reduces the time for searching and traversing subgraph matching, and on the other hand, it generates the flow chart of current user needs more accurately from the dual constraints of semantics and structure. Using generation technology to automatically generate flowcharts according to user needs, which is convenient for users to use directly or fine-tune, simplifies the process of drawing flowcharts, greatly reduces manual intervention, and improves the efficiency of RPA.

附图说明Description of drawings

图1是本发明实施例一种基于语义相似度和子图匹配的流程图生成方法的流程图；Fig. 1 is a flow chart of a flow chart generation method based on semantic similarity and subgraph matching in an embodiment of the present invention;

图2是本发明实施例构建的农行网银流水下载用户需求知识图谱；Fig. 2 is the Agricultural Bank of China online banking running water download user demand knowledge map constructed by the embodiment of the present invention;

图3是本发明实施例构建的RPA资产库项目B公司网银流水下载知识图谱。Fig. 3 is the knowledge map of online banking flow downloading of Company B's RPA asset library project constructed by the embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明实施方式作进一步地描述。In order to make the purpose, technical solution and advantages of the present invention clearer, the embodiments of the present invention will be further described below in conjunction with the accompanying drawings.

本发明提出的一种基于语义相似度和子图匹配的流程图生成方法参考图1，图1是本发明实施例一种基于语义相似度和子图匹配的流程图生成方法的流程图，包括下列步骤：A method for generating a flowchart based on semantic similarity and subgraph matching proposed by the present invention refers to FIG. 1. FIG. 1 is a flowchart of a method for generating a flowchart based on semantic similarity and subgraph matching in an embodiment of the present invention, including the following steps :

S1、获取用户需求文档和RPA项目资产库文档。S1. Obtain user requirement documents and RPA project asset library documents.

进一步的实施例中，用户的需求为农行网银流水下载。In a further embodiment, the user's demand is the online banking flow download of the Agricultural Bank of China.

S2、计算用户需求文档和RPA项目资产库文档的语义相似度，根据语义相似度从高到低排列得到top-k的待匹配子图。S2. Calculate the semantic similarity between the user requirement document and the RPA project asset library document, and arrange top-k subgraphs to be matched according to the semantic similarity from high to low.

具体为：Specifically:

S21、对用户的需求文档Q以及RPA项目资产库用户文档

进行分词，并去除停用词。S21. Requirements document Q for users and user documents of RPA project asset library

Perform word segmentation and remove stop words.

本实施例中，使用分词组件jieba对用户的需求文档Q以及RPA项目资产库用户文档

进行分词。In this embodiment, the word segmentation component jieba is used to analyze the user's requirement document Q and the RPA project asset library user document

Participate.

S22、使用文本中词出现的频率来对文档进行描述，将用户的需求文档Q以及RPA项目资产库用户文档

表示成一维的向量。S22. Use the frequency of words in the text to describe the document, and combine the user's demand document Q and the user document of the RPA project asset library

Represented as a one-dimensional vector.

本实施例中，使用归一化BOW词袋模型的方法来分别表示Q和

。In this embodiment, the normalized BOW bag-of-words model is used to represent Q and

.

S23、根据S22中得到的一维的向量，学习用户需求文档Q的每个词语和RPA项目资产库用户文档

的每个词语的嵌入向量，在这个向量空间中，语义相似的词之间距离相近。S23. According to the one-dimensional vector obtained in S22, learn each word of the user requirement document Q and the user document of the RPA project asset library

The embedding vector of each word in , in this vector space, the distance between semantically similar words is similar.

本实施例中，使用word2vec方法学习用户需求文档Q的每个词语和RPA项目资产库用户文档

的每个词语的嵌入向量。In this embodiment, the word2vec method is used to learn each word of the user demand document Q and the user document of the RPA project asset library

Embedding vectors for each word in .

S24、使用WMD算法计算用户需求文档Q和RPA项目资产库用户文档

之间的文本相似度，WMD算法将文档距离建模成两个文档中词的语义距离的组合，对两个文档中的任意两个词所对应的词向量求欧氏距离然后再加权求和。WMD算法是基于word2vec基础上通过计算文本间词的距离来衡量文本相似度的算法。S24, use the WMD algorithm to calculate the user requirement document Q and the user document of the RPA project asset library

The text similarity between the documents, the WMD algorithm models the document distance as a combination of the semantic distances of the words in the two documents, and calculates the Euclidean distance for the word vectors corresponding to any two words in the two documents, and then reweights the summation . The WMD algorithm is an algorithm based on word2vec to measure text similarity by calculating the distance between words in text.

S241、将文档Q和文档

中的词出现的次数进行归一化处理，计算文档Q中第i个词汇的词频/>

，文档/>

中第j个词汇的词频为/>

：S241, document Q and document

The number of occurrences of the words in the document Q is normalized to calculate the word frequency of the i-th vocabulary in the document Q/>

, document />

The word frequency of the jth vocabulary in is />

:

其中，m、n分别为文档Q、文档D_s中词汇数，

、/>

分别表示文档Q、文档D_s中第i个词、第j个词出现的次数；Among them, m and n are the number of words in document Q and document D _s respectively,

, />

Respectively represent the number of occurrences of the i-th word and j-th word in document Q and document D _s ;

S242、计算来自文档Q的词i和来自文档D_s的词j的两个词间的欧式距离为

：S242. Calculate the Euclidean distance between the word i from the document Q and the word j from the document D _s as

:

其中，

和/>

为词i和j学习到的嵌入向量；in,

and />

Embedding vectors learned for words i and j;

S243、利用动态规划算法求解文档Q和RPA项目资产库D_s的每个文档的WMD距离：S243, using a dynamic programming algorithm to solve the document Q and the WMD distance of each document in the RPA project asset library D _s :

其中，

表示将文档Q中的单词i映射到文档D_s中的单词j的权重，/>

表示文档Q中的单词i和文档D_s中的单词j之间的距离；值得注意的是，文档Q的第i个单词对应到D_s的一个文档中所有单词的权重值的和等于/>

，同理，D_s中的文档的第j个单词映射到文档Q的所有单词的权重值的和等于/>

，其中，f值越小，两个文档越相似。in,

represents the weight that maps word i in document Q to word j in document _Ds , />

Indicates the distance between word i in document Q and word j in document D _s ; it is worth noting that the i-th word in document Q corresponds to the sum of the weight values of all words in a document of D _s equal to />

, in the same way, the sum of the weight values of all words in the document Q mapped to the jth word of the document in D _s is equal to />

, where the smaller the f-value, the more similar the two documents are.

S244、计算文档Q和文档D_s的相似度：S244. Calculate the similarity between the document Q and the document D _s :

S245、设置相似度阈值

，根据文档Q和文档D_s的相似度得到小于阈值/>

的k个RPA项目资产文档以及k个项目对应的知识图谱和流程图。S245, setting a similarity threshold

, according to the similarity between document Q and document D _s , it is less than the threshold />

The k RPA project asset documents and the knowledge maps and flowcharts corresponding to the k projects.

进一步的实施例中，参考图2和图3，图2是本发明实施例构建的农行网银流水下载用户需求知识图谱，图3是本发明实施例构建的RPA资产库项目B公司网银流水下载知识图谱。用户的需求文档包括农行网银流水等，可以匹配到RPA资产库中类似完成的项目文档如B公司网银流水包括建行网银流水和工行网银流水下载。In a further embodiment, refer to Fig. 2 and Fig. 3, Fig. 2 is the user demand knowledge map of the Agricultural Bank of China's online banking running water downloading constructed by the embodiment of the present invention, and Fig. 3 is the RPA asset library project B company's online banking running water downloading knowledge constructed by the embodiment of the present invention Atlas. The user's demand documents include Agricultural Bank of China online banking records, etc., which can be matched to similarly completed project documents in the RPA asset library, such as company B's online banking records, including CCB online banking records and ICBC online banking records.

S3、根据用户需求文档构建查询知识图谱，并设置一个开始节点。如图2所示，用户的需求为农行网银流水下载，并包括农行网银U盾登录、农行网银流水导出以及农行网银流水数据转换的子步骤。S3. Construct a query knowledge map according to the user requirement document, and set a start node. As shown in Figure 2, the user's demand is the download of the ABC online banking flow, and includes the sub-steps of the ABC online banking U-Shield login, the ABC online banking flow export, and the ABC online banking flow data conversion.

S41、使用TALE近似大图匹配工具在待匹配子图中搜索与查询知识图谱最佳可能匹配结果。S41. Use the TALE approximate large graph matching tool to search and query the best possible matching result of the knowledge graph in the subgraph to be matched.

S42、若步骤S41搜索到最佳匹配结果，返回匹配到的候选子图以及匹配到的候选子图对应的流程图，若搜索不到，执行步骤S43。S42. If the best matching result is found in step S41, return the matched candidate subgraph and the flow chart corresponding to the matched candidate subgraph; if no search is found, execute step S43.

S43、根据当前开始节点的边的属性为包含关系进行图划分得到子图集合，重复执行步骤S41和步骤S42，此时开始节点更新为当前节点的尾实体，直到获得最佳匹配结果或子图集合为空时结束。S43. According to the attribute of the edge of the current start node, divide the graph into a subgraph set for the inclusion relationship, and repeat steps S41 and S42. At this time, the start node is updated as the tail entity of the current node until the best matching result or subgraph is obtained. The collection ends when it is empty.

处理器；processor;

其中，计算机程序被所述处理器执行时实现一种基于语义相似度和子图匹配的流程图生成方法。Wherein, when the computer program is executed by the processor, a flowchart generation method based on semantic similarity and subgraph matching is implemented.

本发明提出的技术方案中通过使用WMD算法来衡量用户需求文档和RPA项目资产库文档的相似度，找到与当前用户需求比较相似的项目来减少后续使用子图匹配进一步搜索的范围，提高效率。In the technical solution proposed by the present invention, the WMD algorithm is used to measure the similarity between the user demand document and the RPA project asset library document, and find items similar to the current user demand to reduce the scope of further search using subgraph matching and improve efficiency.

将图划分与子图匹配的方法想结合，迭代的搜索最佳匹配子图。使用模糊子图匹配的方法将用户需求知识图谱在RPA项目资产知识图谱中进行搜索最佳匹配子图，模糊子图匹配允许某些节点不匹配和某些边缺失，推荐出相关的流程图后，可人为参与修正该流程图。当搜索不到时，利用图划分的边划分的思想同时考虑用户需求知识图谱的特征，限定关系边为“包含”关系进行子图划分后再进行搜索最佳匹配子图。Combining the method of graph partitioning and subgraph matching, iteratively searches for the best matching subgraph. Use the method of fuzzy subgraph matching to search the user demand knowledge map in the RPA project asset knowledge map for the best matching subgraph. Fuzzy subgraph matching allows some nodes to be mismatched and some edges to be missing. After recommending the relevant flow chart , the flow chart can be amended manually. When the search cannot be found, use the idea of edge division of graph division and consider the characteristics of the user demand knowledge graph, limit the relationship edge to the "containment" relationship for subgraph division, and then search for the best matching subgraph.

利用语义相似度和子图匹配的方法共同生成流程图：首先使用语义相似度进行第一轮的搜索，得到与用户需求较为相似的历史项目，同时减少第二轮子图匹配搜索空间；其次使用模糊子图匹配进行第二轮搜索，找到与用户需求语义和结构上都较为相似的历史项目作为最佳匹配项目，并找到对应的流程图。Using the method of semantic similarity and subgraph matching to jointly generate a flow chart: first, use semantic similarity to conduct the first round of search, get historical items that are relatively similar to user needs, and reduce the second round of subgraph matching search space; secondly, use fuzzy subgraph matching Graph matching conducts a second round of search, finds historical items that are similar in semantics and structure to user needs as the best matching item, and finds the corresponding flow chart.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其它实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention will not be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A flow chart generation method based on semantic similarity and subgraph matching, characterized in that, comprising the following steps:

S1. Obtain user requirement documents and RPA project asset library documents;

S2. Calculate the semantic similarity between the user requirement document and the RPA project asset library document, and arrange top-k subgraphs to be matched according to the semantic similarity from high to low;

S3. Construct a query knowledge map according to the user requirement document, and set a start node;

S4. Search and match the subgraphs to be matched in the query knowledge graph, obtain the final best matching graph, and return the corresponding flow chart.

2. A method for generating a flow chart based on semantic similarity and subgraph matching according to claim 1, wherein step S2 is specifically:

S21. Requirements document Q for users and user documents of RPA project asset library

Perform word segmentation and remove stop words;

S22. Use the frequency of words in the text to describe the document, and combine the user's demand document Q and the user document of the RPA project asset library

Expressed as a one-dimensional vector;

S23. According to the one-dimensional vector obtained in S22, learn each word of the user requirement document Q and the user document of the RPA project asset library

The embedding vector of each word in ;

S24, use the WMD algorithm to calculate the user requirement document Q and the user document of the RPA project asset library

similarity between texts.

3. A flow chart generation method based on semantic similarity and subgraph matching according to claim 2, characterized in that in step S21, the word segmentation component jieba is used to analyze the user's demand document Q and the RPA project asset library user document

Participate.

4. A flow chart generation method based on semantic similarity and subgraph matching according to claim 2, characterized in that, in step S22, the method of normalized BOW bag-of-words model is used to represent Q and

.

5. a kind of flow chart generation method based on semantic similarity and subgraph matching according to claim 2, it is characterized in that, in step S23, use word2vec method to learn each word of user demand document Q and RPA project asset library user document

Embedding vectors for each word in .

6. A method for generating a flow chart based on semantic similarity and subgraph matching according to claim 2, wherein step S24 is specifically:

S241, document Q and document

, document />

The word frequency of the jth vocabulary in is />

:

Among them, m and n are the number of words in document Q and document D _s respectively,

, />

S242. Calculate the Euclidean distance between word i from document Q and word j from document D _s as C _i,j :

in,

and />

Embedding vectors learned for words i and j;

S243. Using a dynamic programming algorithm to solve the WMD distance between the document Q and the document D _s :

where f is the WMD distance between document Q and document D _s ,

;

S244. Calculate the similarity between the document Q and the document D _s :

S245, setting a similarity threshold

7. A method for generating a flow chart based on semantic similarity and subgraph matching according to claim 1, wherein step S4 is specifically:

S41. Use the TALE approximate large graph matching tool to search and query the best possible matching result of the knowledge graph in the subgraph to be matched;

S42. If the best matching result is found in step S41, the matched candidate subgraph and the flow chart corresponding to the matched candidate subgraph are returned. If no search is found, step S43 is executed;

S43. According to the attribute of the edge of the current start node, divide the graph into a subgraph set for the inclusion relationship, and repeat steps S41 and S42. At this time, the start node is updated as the tail entity of the current node until the best matching result or subgraph is obtained. The collection ends when it is empty.

8. A flow chart generation device based on semantic similarity and subgraph matching, characterized in that the device comprises:

processor;

a memory on which is stored a computer program executable on said processor;

Wherein, when the computer program is executed by the processor, a flow chart generation method based on semantic similarity and subgraph matching according to any one of claims 1 to 7 is implemented.