CN118551760B - Purchasing file compliance checking system based on difference algorithm under AI large model - Google Patents
Purchasing file compliance checking system based on difference algorithm under AI large model Download PDFInfo
- Publication number
- CN118551760B CN118551760B CN202411012102.5A CN202411012102A CN118551760B CN 118551760 B CN118551760 B CN 118551760B CN 202411012102 A CN202411012102 A CN 202411012102A CN 118551760 B CN118551760 B CN 118551760B
- Authority
- CN
- China
- Prior art keywords
- words
- procurement
- illegal
- key words
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
Description
技术领域Technical Field
本发明属于文件稽核技术领域,涉及一种基于AI大模型下差异算法的采购文件合规性检查系统。The present invention belongs to the technical field of document auditing and relates to a procurement document compliance checking system based on a difference algorithm under an AI large model.
背景技术Background Art
在企业的采购过程中,合规性是一个非常重要的要素。采购文件需要符合国家的法律法规、政府的政策要求以及企业内部的采购流程和规定。然而,由于采购文件的内容繁多、涉及领域广泛,而且涉及法律法规、金融、税务、保密等多个方面的要求,人工检查采购文件的合规性往往耗时费力,且容易出现漏洞和错误。Compliance is a very important factor in the procurement process of an enterprise. Procurement documents need to comply with national laws and regulations, government policy requirements, and the internal procurement process and regulations of the enterprise. However, due to the large content of procurement documents, the wide range of fields involved, and the requirements of laws, regulations, finance, taxation, confidentiality and other aspects, manual inspection of the compliance of procurement documents is often time-consuming and laborious, and prone to loopholes and errors.
为此,人们推出了一系列采购文件合规性检查的方法,这些方法通过识别文件的违规词来推断文件是否合规,降低了人工检查的成本。然而,这些方法较为依赖识别的违规词,较少关注与违规词相关搭配的词汇,导致对违规词的词性判断不够准确,降低了检查效率。违规词往往需要在特殊的语境下才会形成违规,同一个词汇在一个采购文件类型中违规,而在另一个采购文件类型中可能就不违规。因此,为了提升采购合规性检查的准确性,还需要识别违规词相关搭配的关键词汇,充分确认违规词是否真的违规。To this end, people have introduced a series of methods for compliance inspection of procurement documents. These methods infer whether the documents are compliant by identifying the illegal words in the documents, reducing the cost of manual inspection. However, these methods rely more on the identified illegal words and pay less attention to the words that are collocated with the illegal words, resulting in inaccurate judgment of the word class of the illegal words and reduced inspection efficiency. Illegal words often need to be in a special context to constitute violations. The same word may be illegal in one type of procurement document, but not in another type of procurement document. Therefore, in order to improve the accuracy of procurement compliance inspection, it is also necessary to identify the key words related to the illegal words to fully confirm whether the illegal words are actually illegal.
发明内容Summary of the invention
为解决现有技术中存在的上述问题,本发明提供了一种基于AI大模型下差异算法的采购文件合规性检查系统。In order to solve the above problems existing in the prior art, the present invention provides a procurement document compliance checking system based on a difference algorithm under an AI big model.
本发明的目的可以通过以下技术方案实现:The purpose of the present invention can be achieved through the following technical solutions:
本申请提供了一种基于AI大模型下差异算法的采购文件合规性检查系统,包括文件上传模块、分词模块、违规词分析模块和报告生成模块,所述文件上传模块、分词模块、违规词分析模块和报告生成模块通信连接,其中:The present application provides a procurement document compliance checking system based on a difference algorithm under an AI large model, including a file upload module, a word segmentation module, an illegal word analysis module and a report generation module, wherein the file upload module, the word segmentation module, the illegal word analysis module and the report generation module are communicatively connected, wherein:
所述文件上传模块,用于上传并识别待检查采购文件;The file upload module is used to upload and identify the procurement documents to be checked;
所述分词模块,用于识别并提取待检查采购文件中的关键词汇;The word segmentation module is used to identify and extract key words in the procurement documents to be checked;
所述违规词分析模块,用于将关键词汇输入预设的违规词分析单元中,输出判定为违规词的关键词汇;所述违规词分析单元,包括以下构建步骤:The illegal word analysis module is used to input key words into a preset illegal word analysis unit and output key words determined to be illegal words; the illegal word analysis unit includes the following construction steps:
S1、收集大规模历史时期包含违规词的采购文件数据样本;S1. Collect a large-scale sample of procurement document data containing illegal words in historical periods;
S2、识别违规词所在的句子或段落,提取所述句子或段落中除违规词以外的关键词汇;S2, identifying the sentence or paragraph where the offending word is located, and extracting key words other than the offending word in the sentence or paragraph;
S3、以关键词汇为解释变量,违规词为响应变量,构建随机森林模型量化违规词与其对应的关键词汇之间的联系;S3, using key words as explanatory variables and offending words as response variables, a random forest model was constructed to quantify the relationship between offending words and their corresponding key words;
所述将关键词汇输入预设的违规词分析单元中,输出判定为违规词的关键词汇,包括以下步骤:The step of inputting the key words into a preset illegal word analysis unit and outputting the key words determined as illegal words comprises the following steps:
根据待检查采购文件的采购信息,确定采购文件类型;Determine the type of procurement document based on the procurement information of the procurement document to be checked;
调用所述采购文件类型的相关违规词;Invoke the relevant offending words of the procurement document type in question;
将待检查采购文件的关键词汇输入预设的违规词分析单元,输出对每个句子或段落中的预测违规词;Input the key words of the procurement document to be checked into a preset illegal word analysis unit, and output the predicted illegal words in each sentence or paragraph;
将预测违规词与该采购文件类型的相关违规词进行比对,当预测违规词与相关违规词相同时,则判定预测违规词所在的句子或段落违规;Compare the predicted illegal words with the related illegal words of the procurement document type. When the predicted illegal words are the same as the related illegal words, the sentence or paragraph containing the predicted illegal words is determined to be illegal.
所述报告生成模块,用于根据输出判定为违规词的关键词汇,生成采购文件的合规性检查报告。The report generation module is used to generate a compliance check report for the procurement document based on the output of key words determined to be illegal words.
进一步地,分词模块中,所述识别并提取待检查采购文件中的关键词汇,包括以下步骤:Furthermore, in the word segmentation module, the identification and extraction of key words in the procurement document to be checked includes the following steps:
T1、数据预处理:对待检查采购文件进行预处理,所述预处理包括去除多余的标点符号、停用词和噪音字符;T1. Data preprocessing: preprocess the procurement documents to be inspected, including removing redundant punctuation marks, stop words and noise characters;
T2、实体识别:采用实体识别技术识别采购文件的采购信息,所述采购信息包括供应商信息、采购物品和采购价格;T2. Entity recognition: Entity recognition technology is used to identify the procurement information of procurement documents, including supplier information, procurement items and procurement prices;
T3、关系抽取:识别与采购信息之间存在依赖关系的关键词汇。T3. Relationship extraction: Identify key words that have dependencies with procurement information.
进一步地,步骤S3中,所述随机森林模型,包括以下步骤:Furthermore, in step S3, the random forest model includes the following steps:
S31、根据关键词汇在句子或段落中的顺序对关键词汇进行编码,将关键词汇转换成词序列向量;S31, encoding the key words according to the order of the key words in the sentence or paragraph, and converting the key words into a word sequence vector;
S32、以词序列向量作为特征,违规词作为标签,同时将采购文件数据样本划分为训练集和测试集;S32, using the word sequence vector as a feature and the illegal word as a label, and dividing the procurement document data sample into a training set and a test set;
S33、随机抽样:从训练集中随机抽取一定数量的数据样本,并有放回地构建多个含不同数据样本的训练集;S33, Random Sampling: Randomly extract a certain number of data samples from the training set, and construct multiple training sets containing different data samples with replacement;
S34、构建决策树:对于每个随机抽样的训练集,构建决策树模型;S34, constructing a decision tree: for each randomly sampled training set, construct a decision tree model;
S35、集成决策树:将构建好的多个决策树模型整合成随机森林模型,利用投票机制做出最终的预测;S35, Ensemble Decision Tree: Integrate multiple decision tree models into a random forest model and use the voting mechanism to make the final prediction;
S36、模型评估:使用评估指标评估随机森林模型对测试集的预测性能。S36. Model evaluation: Use evaluation metrics to evaluate the prediction performance of the random forest model on the test set.
进一步地,步骤S31中,所述根据关键词汇在句子或段落中的顺序对关键词汇进行编码,将关键词汇转换成词序列向量,具体采用one-hot编码、词袋模型或词嵌入模型进行。Furthermore, in step S31, the key words are encoded according to their order in the sentence or paragraph, and the key words are converted into word sequence vectors, specifically using one-hot encoding, bag-of-words model or word embedding model.
进一步地,步骤S32中,所述将采购文件数据样本划分为训练集和测试集,其中采用70%的数据样本作为训练集,30%的数据样本作为测试集。Furthermore, in step S32, the procurement document data samples are divided into a training set and a test set, wherein 70% of the data samples are used as the training set and 30% of the data samples are used as the test set.
进一步地,步骤S34中,所述决策树模型,配置为CART模型。Furthermore, in step S34, the decision tree model is configured as a CART model.
进一步地,步骤S36中,所述评估指标包括准确率、精确率、召回率或F1值。Furthermore, in step S36, the evaluation index includes accuracy, precision, recall or F1 value.
进一步地,报告生成模块中,所述合规性检查报告,内容包括采购信息、以违规词生成的所在句子或段落问题描述以及修改建议。Furthermore, in the report generation module, the compliance check report includes procurement information, a description of the problem in the sentence or paragraph generated with the illegal words, and modification suggestions.
进一步地,所述违规词分析模块,还通过语义识别技术对比不同模板的待检查采购文件相同关键词所在的句子或段落,当相同关键词所在的句子或段落内容含义不一致时,系统进行报错处理。Furthermore, the illegal word analysis module also uses semantic recognition technology to compare sentences or paragraphs containing the same keywords in the procurement documents to be checked of different templates. When the meanings of the sentences or paragraphs containing the same keywords are inconsistent, the system reports an error.
进一步地,所述报告生成模块,还根据用户历史时期对违规词的纠错记录进行训练,优化违规词检查结果,以标注重点违规句段。Furthermore, the report generation module is also trained based on the user's historical correction records of illegal words, and optimizes the illegal word checking results to mark key illegal sentences.
进一步地,所述分词模块中,还通过差异算法检查待检查采购文件与已存储的采购文件中关于同一关键词汇所在句子或段落,判断并标记该句子或段落存在的差异。Furthermore, in the word segmentation module, the sentences or paragraphs containing the same key words in the procurement document to be checked and the stored procurement document are checked by a difference algorithm, and the differences between the sentences or paragraphs are determined and marked.
本发明的有益效果:Beneficial effects of the present invention:
1)通过识别违规词以及违规词所在句子或段落中的其它关键词汇,构建关键词汇与违规词之间的联系,解决了现有技术中未识别与违规词相关搭配的词汇,导致对违规词的词性判断不够准确的问题;1) By identifying the offending words and other key words in the sentence or paragraph where the offending words are located, the connection between the key words and the offending words is established, which solves the problem that the existing technology does not identify the words related to the offending words, resulting in inaccurate part of speech judgment of the offending words;
2)对于不同采购文件类型,通过识别采购文件的采购信息并抽取与其存在依赖关系的关键词汇,界定了不同采购文件类型违规词的关键词汇,提高对违规词词性的判断;2) For different types of procurement documents, by identifying the procurement information of the procurement documents and extracting the key words that have a dependency relationship with it, the key words of illegal words in different types of procurement documents are defined to improve the judgment of the part of speech of illegal words;
3)构建关键词汇与违规词之间的联系过程中采用随机森林模型,提高二者关系量化的准确性;3) A random forest model is used in the process of building the connection between key words and illegal words to improve the accuracy of quantifying the relationship between the two;
4)根据关键词汇在句子或段落中的顺序对关键词汇进行编码,将关键词汇转换成词序列向量,使关键词汇特征突出,提高模型预测的准确性。4) Encode the key words according to their order in the sentence or paragraph, and convert them into word sequence vectors to highlight the key word features and improve the accuracy of model prediction.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了便于本领域技术人员理解,下面结合附图对本发明作进一步的说明。In order to facilitate understanding by those skilled in the art, the present invention is further described below with reference to the accompanying drawings.
图1为本发明中基于AI大模型下差异算法的采购文件合规性检查系统的流程图。FIG1 is a flow chart of a procurement document compliance checking system based on a difference algorithm under an AI big model in the present invention.
具体实施方式DETAILED DESCRIPTION
为更进一步阐述本发明为实现预定发明目的所采取的技术手段及功效,以下结合附图及较佳实施例,对依据本发明的具体实施方式、结构、特征及其功效,详细说明如后。In order to further explain the technical means and effects adopted by the present invention to achieve the predetermined invention purpose, the specific implementation mode, structure, characteristics and effects of the present invention are described in detail below in combination with the accompanying drawings and preferred embodiments.
请参阅图1,本申请提供了基于AI大模型下差异算法的采购文件合规性检查系统,包括文件上传模块、分词模块、违规词分析模块和报告生成模块,所述文件上传模块、分词模块、违规词分析模块和报告生成模块通信连接,其中:Please refer to Figure 1. This application provides a procurement document compliance inspection system based on a difference algorithm under an AI large model, including a file upload module, a word segmentation module, an illegal word analysis module, and a report generation module. The file upload module, the word segmentation module, the illegal word analysis module, and the report generation module are communicatively connected, wherein:
所述文件上传模块,用于上传并识别待检查采购文件;The file upload module is used to upload and identify the procurement documents to be checked;
在本实施例中,本系统采用先进的B/S架构,支持多用户同时在线上传采购文件。界面设计简洁直观,操作流程清晰明了,便于用户快速上手。通过用户友好的图形化界面,用户能够轻松完成文件的上传、查看、审核及导出等操作。用户可通过软件界面上传采购文件,支持多种文件格式(如Word、Excel等)。系统具备强大的文件识别功能,能够自动提取文件中的采购信息,如供应商信息、采购物品、价格等,为后续稽核工作提供数据支持。In this embodiment, the system adopts an advanced B/S architecture and supports multiple users to upload procurement documents online at the same time. The interface design is simple and intuitive, and the operation process is clear and concise, which is convenient for users to get started quickly. Through the user-friendly graphical interface, users can easily complete operations such as uploading, viewing, reviewing and exporting files. Users can upload procurement documents through the software interface, and support multiple file formats (such as Word, Excel, etc.). The system has a powerful file recognition function and can automatically extract procurement information in the file, such as supplier information, purchased items, prices, etc., to provide data support for subsequent audit work.
所述分词模块,用于识别并提取待检查采购文件中的关键词汇;The word segmentation module is used to identify and extract key words in the procurement documents to be checked;
进一步地,分词模块中,所述识别并提取待检查采购文件中的关键词汇,包括以下步骤:Furthermore, in the word segmentation module, the identification and extraction of key words in the procurement document to be checked includes the following steps:
T1、数据预处理:对待检查采购文件进行预处理,所述预处理包括去除多余的标点符号、停用词和噪音字符;T1. Data preprocessing: preprocess the procurement documents to be inspected, including removing redundant punctuation marks, stop words and noise characters;
在本实施例中,对于预处理采购文件以去除多余的标点符号、停用词和噪音字符,可以遵循以下步骤:In this embodiment, the following steps may be followed to preprocess the procurement document to remove redundant punctuation marks, stop words, and noise characters:
1)去除标点符号:使用正则表达式或字符串处理函数将文本中的标点符号去除,例如逗号、句号、感叹号等。这可以保留关键词和内容,排除无用的符号。1) Remove punctuation: Use regular expressions or string processing functions to remove punctuation from the text, such as commas, periods, exclamation marks, etc. This can retain keywords and content and exclude useless symbols.
2)去除停用词:停用词是指在文本中频繁出现但通常不携带重要意义的词语,如"的"、"和"、"是"等。可以使用停用词列表(如NLTK库提供的默认停用词列表)或自定义停用词列表来去除这些词语,以减少噪音并提取采购信息。2) Remove stop words: Stop words are words that appear frequently in the text but usually do not carry important meanings, such as "的", "和", "是", etc. You can use a stop word list (such as the default stop word list provided by the NLTK library) or a custom stop word list to remove these words to reduce noise and extract purchase information.
3)去除噪音字符:通过正则表达式或字符串处理函数去除文本中的噪音字符,例如HTML标签、特殊字符如#、@等。这样可以清洁文本并提高后续处理的效果。3) Remove noise characters: Use regular expressions or string processing functions to remove noise characters in the text, such as HTML tags, special characters such as #, @, etc. This can clean the text and improve the effect of subsequent processing.
需要注意的是,在进行预处理时,应根据具体需求和场景来确定需要去除的标点符号、停用词和噪音字符,以免误删重要信息。同时,预处理后的文本应进行适当的验证,确保不会影响采购文件的完整性和可读性。It should be noted that during preprocessing, the punctuation, stop words and noise characters that need to be removed should be determined according to specific needs and scenarios to avoid accidentally deleting important information. At the same time, the preprocessed text should be properly verified to ensure that it does not affect the integrity and readability of the procurement documents.
T2、实体识别:采用实体识别技术识别采购文件的采购信息,所述采购信息包括供应商信息、采购物品和采购价格;T2. Entity recognition: Entity recognition technology is used to identify the procurement information of procurement documents, including supplier information, procurement items and procurement prices;
T3、关系抽取:识别与采购信息之间存在依赖关系的关键词汇。T3. Relationship extraction: Identify key words that have dependencies with procurement information.
在本实施例中,当采用实体识别技术来识别采购文件的采购信息时,可以将关注点放在供应商信息、采购物品和采购价格等重要内容上。通过实体识别技术,可以自动标注出文件中的这些采购信息,从而使其易于识别和提取。同时,了解采购信息之间的依赖关键词汇也是非常重要的。这些依赖关键词汇可以帮助我们理解采购信息之间的联系和作用。例如,供应商信息可能与采购物品的交付能力、质量保证等相关联;采购价格可能与采购物品的数量、规格等存在关联。为了实现采购信息和依赖关键词汇的识别,可以采用实体关系提取技术。该技术可以通过对采购文件进行标注,将采购信息和它们之间的依赖关系进行建模和标记。然后,利用训练好的实体关系提取模型,可以实现自动化地识别和提取这些关系。整个流程包括数据准备、实体识别、实体关系标注、实体关系提取模型的训练和实体关系的提取等步骤。针对特定的领域和任务需求,可以进行模型的优化和定制,以提高准确性和适应性。In this embodiment, when entity recognition technology is used to identify the procurement information of the procurement document, the focus can be placed on important contents such as supplier information, purchased items and purchase price. Through entity recognition technology, these procurement information in the document can be automatically marked, so that it is easy to identify and extract. At the same time, it is also very important to understand the dependent key words between the procurement information. These dependent key words can help us understand the connection and function between the procurement information. For example, supplier information may be associated with the delivery capability, quality assurance, etc. of the purchased items; the purchase price may be associated with the quantity, specifications, etc. of the purchased items. In order to realize the identification of procurement information and dependent key words, entity relationship extraction technology can be used. This technology can model and mark the procurement information and the dependency relationship between them by marking the procurement documents. Then, using the trained entity relationship extraction model, these relationships can be automatically identified and extracted. The whole process includes steps such as data preparation, entity recognition, entity relationship labeling, entity relationship extraction model training and entity relationship extraction. For specific fields and task requirements, the model can be optimized and customized to improve accuracy and adaptability.
需要注意的是,实体关系提取的性能受到多种因素的影响,包括数据质量、标注精度和训练数据的多样性等。因此,在实际应用中,需要仔细调整和优化模型,以保证实体关系提取的准确性和可靠性。It should be noted that the performance of entity relationship extraction is affected by many factors, including data quality, annotation accuracy, and diversity of training data. Therefore, in practical applications, the model needs to be carefully adjusted and optimized to ensure the accuracy and reliability of entity relationship extraction.
进一步地,所述分词模块中,还通过差异算法检查待检查采购文件与已存储的采购文件中关于同一关键词汇所在句子或段落,判断并标记该句子或段落存在的差异。Furthermore, in the word segmentation module, the sentences or paragraphs containing the same key words in the procurement document to be checked and the stored procurement document are checked by a difference algorithm, and the differences between the sentences or paragraphs are determined and marked.
在本实施例中,差异算法是一种计算机算法,用于比较两个或多个文本之间的差异。它通过比较文本中的字符、单词、句子或其他文本单位的不同之处来确定它们之间的异同。差异算法可以用于各种应用,包括版本控制、文档比较、文本合并等。在文本比较中,差异算法可以标识出两个文本之间添加、删除或修改的部分。最常见的差异算法之一是最长公共子序列(LCS)算法,它可以确定两个文本之间最长的共同序列。通过找到最长公共子序列,差异算法可以确定哪些部分是相同的,哪些部分是不同的。另一个常用的差异算法是基于行的差异算法,例如行对比(line diff)或修订版差异算法(diff)。这些算法比较文本行之间的差异,并给出添加、删除和修改行的指示。差异算法还可以提供更复杂的文本比较功能,例如基于文本结构的比较和语义级别的比较。这些算法可以识别出文本之间更深层次的差异,例如句子重排、语义重写等。总的来说,差异算法可以帮助我们识别和理解文本之间的差异,从而为文本处理、版本控制和文档比较等任务提供支持。它在采购文件稽核中可以用于检测信息错误、不一致性和抄袭问题,提高采购文件的质量和准确性:In this embodiment, a difference algorithm is a computer algorithm for comparing the differences between two or more texts. It determines the similarities and differences between them by comparing the differences between characters, words, sentences or other text units in the text. Difference algorithms can be used in various applications, including version control, document comparison, text merging, etc. In text comparison, the difference algorithm can identify the parts that are added, deleted or modified between two texts. One of the most common difference algorithms is the longest common subsequence (LCS) algorithm, which can determine the longest common sequence between two texts. By finding the longest common subsequence, the difference algorithm can determine which parts are the same and which parts are different. Another commonly used difference algorithm is a line-based difference algorithm, such as line diff or revision difference algorithm (diff). These algorithms compare the differences between text lines and give indications of added, deleted and modified lines. Difference algorithms can also provide more complex text comparison functions, such as comparison based on text structure and comparison at the semantic level. These algorithms can identify deeper differences between texts, such as sentence rearrangement, semantic rewriting, etc. In general, difference algorithms can help us identify and understand the differences between texts, thereby providing support for tasks such as text processing, version control and document comparison. It can be used in procurement document audits to detect information errors, inconsistencies and plagiarism, and improve the quality and accuracy of procurement documents:
首先,差异算法可以帮助发现文档中的信息错误和不一致性。在多个文档中,描述同一关键词汇的句子或段落可能会存在差异,例如数字不一致、用词不当或逻辑矛盾等。通过差异算法,可以准确地检测到这些差异,从而使采购人员能够迅速识别和纠正错误,确保采购文件的准确性和一致性。First, the difference algorithm can help find information errors and inconsistencies in documents. In multiple documents, sentences or paragraphs describing the same key words may have differences, such as inconsistent numbers, inappropriate words, or logical contradictions. Through the difference algorithm, these differences can be accurately detected, allowing procurement personnel to quickly identify and correct errors and ensure the accuracy and consistency of procurement documents.
其次,差异算法可以用于检测文档抄袭或非原创内容的问题。尽管在采购文件中抄袭可能并不常见,但是差异算法仍然可以发现文本在不同文档中的重复或修改情况,帮助鉴别可能存在的知识产权问题。通过比较关键词汇所在句子或段落的差异,采购人员可以避免使用未经授权或未经充分引用的内容,确保采购文件的合法性和可信度。Secondly, the difference algorithm can be used to detect issues with plagiarism or unoriginal content in documents. Although plagiarism may not be common in procurement documents, the difference algorithm can still find duplication or modification of text in different documents, helping to identify possible intellectual property issues. By comparing the differences in sentences or paragraphs where key words are located, procurement personnel can avoid using unauthorized or insufficiently cited content, ensuring the legitimacy and credibility of procurement documents.
再次,差异算法还可以帮助发现文档漏洞和遗漏。在多个文档中,关于同一关键词汇的信息可能会有遗漏或缺失。通过差异检查,可以揭示文档之间的差异,识别出信息的缺失,为采购人员提供补充和完善的方向。这有助于确保采购文件中包含全面而全面的信息,降低遗漏导致的风险和问题。Thirdly, the difference algorithm can also help find document loopholes and omissions. In multiple documents, information about the same key word may be missing or missing. Through difference checking, the differences between documents can be revealed, the missing information can be identified, and the procurement staff can be provided with directions for supplementation and improvement. This helps ensure that the procurement documents contain comprehensive and comprehensive information and reduce the risks and problems caused by omissions.
然后,差异算法还能提供准确的汇总和综合信息。通过比较关键词汇所在句子或段落的差异,可以生成综合感知和汇总报告。这种汇总和综合信息提供给采购人员,有助于他们更好地了解多个文档之间的异同,从而为决策和评估过程提供更全面和准确的依据。Then, the difference algorithm can also provide accurate summary and comprehensive information. By comparing the differences in sentences or paragraphs where key words are located, a comprehensive perception and summary report can be generated. This summary and comprehensive information is provided to procurement personnel to help them better understand the similarities and differences between multiple documents, thereby providing a more comprehensive and accurate basis for the decision-making and evaluation process.
最后,差异算法可以辅助采购决策和评估过程。通过比较差异,采购人员可以评估不同方案或提供商之间的差异,了解每个文档中独特的优势和劣势。这有助于采购人员做出更明智的决策,并评估提供商是否符合采购需求和标准。Finally, the difference algorithm can assist the procurement decision and evaluation process. By comparing the differences, procurement personnel can evaluate the differences between different solutions or providers and understand the unique advantages and disadvantages in each document. This helps procurement personnel make more informed decisions and evaluate whether the provider meets procurement needs and standards.
所述违规词分析模块,用于将关键词汇输入预设的违规词分析单元中,输出判定为违规词的关键词汇;所述违规词分析单元,包括以下构建步骤:The illegal word analysis module is used to input key words into a preset illegal word analysis unit and output key words determined to be illegal words; the illegal word analysis unit includes the following construction steps:
S1、收集大规模历史时期包含违规词的采购文件数据样本;S1. Collect a large-scale sample of procurement document data containing illegal words in historical periods;
S2、识别违规词所在的句子或段落,提取所述句子或段落中除违规词以外的关键词汇;S2, identifying the sentence or paragraph where the offending word is located, and extracting key words other than the offending word in the sentence or paragraph;
S3、以关键词汇为解释变量,违规词为响应变量,构建随机森林模型量化违规词与其对应的关键词汇之间的联系;S3, using key words as explanatory variables and offending words as response variables, a random forest model was constructed to quantify the relationship between offending words and their corresponding key words;
在本实施例中,随机森林(Random Forest)是一种集成学习(Ensemble Learning)方法,它由多个决策树构成。每个决策树都是独立训练的,并通过投票或平均等方式进行集成预测。In this embodiment, Random Forest is an ensemble learning method, which is composed of multiple decision trees. Each decision tree is trained independently, and ensemble prediction is performed by voting or averaging.
以下是随机森林的主要特点和步骤:Following are the main features and steps of Random Forest:
1)特征随机性:在构建每个决策树时,随机森林引入特征随机性。这意味着在每个节点划分时,只考虑随机选择的一部分特征进行划分。这样可以帮助减少决策树间的相关性,提高模型的多样性,减少过拟合风险。1) Feature randomness: When constructing each decision tree, random forest introduces feature randomness. This means that when each node is divided, only a portion of the randomly selected features is considered for division. This can help reduce the correlation between decision trees, improve the diversity of the model, and reduce the risk of overfitting.
2)样本随机性:对于每个决策树的训练样本,随机森林采用有放回抽样(Bootstrap Sampling)的方式进行随机选择。这意味着每个决策树都基于不同的样本集进行训练,增加了模型的多样性。2) Sample randomness: For each decision tree training sample, random forest uses bootstrap sampling to randomly select. This means that each decision tree is trained based on a different sample set, increasing the diversity of the model.
3)构建决策树:每个决策树都基于划分节点的纯度指标,如基尼系数(GiniIndex)或信息增益(Information gain)等,进行特征选择和节点划分。决策树的深度可以通过调整参数进行控制。3) Building a decision tree: Each decision tree performs feature selection and node division based on the purity index of the node division, such as Gini Index or information gain. The depth of the decision tree can be controlled by adjusting parameters.
4)集成预测:对于分类问题,随机森林通过投票来确定最终的预测结果。对于回归问题,随机森林通过平均每个决策树的输出来得到最终预测结果。4) Ensemble prediction: For classification problems, random forests determine the final prediction results by voting. For regression problems, random forests get the final prediction results by averaging the output of each decision tree.
随机森林模型具有以下优点:The random forest model has the following advantages:
1)鲁棒性:随机森林能够处理高维度的数据和大量的特征,对异常值和噪声具有较好的鲁棒性。1) Robustness: Random forests can handle high-dimensional data and a large number of features, and are robust to outliers and noise.
2)可解释性:随机森林可以提供特征重要性指标,帮助理解特征对于模型预测的贡献程度。2) Interpretability: Random forests can provide feature importance indicators to help understand the contribution of features to model predictions.
3)抗过拟合:通过特征随机性和样本随机性的引入,随机森林可以减少模型的方差,提高泛化能力,降低过拟合风险。3) Anti-overfitting: By introducing feature randomness and sample randomness, random forest can reduce the variance of the model, improve generalization ability, and reduce the risk of overfitting.
所述将关键词汇输入预设的违规词分析单元中,输出判定为违规词的关键词汇,包括以下步骤:The step of inputting the key words into a preset illegal word analysis unit and outputting the key words determined as illegal words comprises the following steps:
根据待检查采购文件的采购信息,确定采购文件类型;Determine the type of procurement document based on the procurement information of the procurement document to be checked;
调用所述采购文件类型的相关违规词;Invoke the relevant offending words of the procurement document type in question;
将待检查采购文件的关键词汇输入预设的违规词分析单元,输出对每个句子或段落中的预测违规词;Input the key words of the procurement document to be checked into a preset illegal word analysis unit, and output the predicted illegal words in each sentence or paragraph;
将预测违规词与该采购文件类型的相关违规词进行比对,当预测违规词与相关违规词相同时,则判定预测违规词所在的句子或段落违规;Compare the predicted illegal words with the related illegal words of the procurement document type. When the predicted illegal words are the same as the related illegal words, the sentence or paragraph containing the predicted illegal words is determined to be illegal.
现在技术采购文件核查系统基本都是要在系统中把采购文件结构化,然后在指定的位置识别指定的关键词进行核查,这种技术是不易于推广的,因为不同行业、不同客户的采购文件模板都不一样。而本实施例还可以通过识别关键词出现的句子或段落,通过语义识别技术对比相关段落是否表达相同的内容,表达相同内容但是内容含义不完全一致的就会报错。通过这项处理,用户不需要在系统中将采购文件结构化,就可以实现对不同模板的采购文件进行稽查。At present, the technical procurement document verification system basically requires that the procurement documents be structured in the system, and then the specified keywords are identified in the specified position for verification. This technology is not easy to promote because the procurement document templates of different industries and different customers are different. However, this embodiment can also identify the sentences or paragraphs where the keywords appear, and use semantic recognition technology to compare whether the relevant paragraphs express the same content. If the content expresses the same content but the meaning is not completely consistent, an error will be reported. Through this process, users do not need to structure the procurement documents in the system to realize the audit of procurement documents with different templates.
进一步地,所述违规词分析模块,还通过语义识别技术对比不同模板的待检查采购文件相同关键词所在的句子或段落,当相同关键词所在的句子或段落内容含义不一致时,系统进行报错处理。Furthermore, the illegal word analysis module also uses semantic recognition technology to compare sentences or paragraphs containing the same keywords in the procurement documents to be checked of different templates. When the meanings of the sentences or paragraphs containing the same keywords are inconsistent, the system reports an error.
所述报告生成模块,用于根据输出判定为违规词的关键词汇,生成采购文件的合规性检查报告。The report generation module is used to generate a compliance check report for the procurement document based on the output of key words determined to be illegal words.
进一步地,步骤S3中,所述随机森林模型,包括以下步骤:Furthermore, in step S3, the random forest model includes the following steps:
S31、根据关键词汇在句子或段落中的顺序对关键词汇进行编码,将关键词汇转换成词序列向量;S31, encoding the key words according to the order of the key words in the sentence or paragraph, and converting the key words into a word sequence vector;
在本实施例中,每个数据样本中的违规词所在句子或段落中的关键词包含多个,它们往往按照一定的顺序与违规词形成搭配,根据关键词汇在句子或段落中的顺序对关键词汇进行编码,将关键词汇转换成词序列向量,可以将关键词汇文本形式转化为数据形式,同时按照顺序编码可以形成具有表达关键词汇特征的数据序列,使特征更加明显,提高了模型的准确性。将关键词汇转换成词序列向量的过程可以采用不同的编码方法,如one-hot编码、词袋模型或者词嵌入模型(如Word2Vec、GloVe等),具体的选择与数据的特点和任务需求有关。In this embodiment, the sentence or paragraph containing the offending word in each data sample contains multiple keywords, which are often matched with the offending words in a certain order. The key words are encoded according to the order of the key words in the sentence or paragraph, and the key words are converted into word sequence vectors. The text form of the key words can be converted into data form. At the same time, the sequential encoding can form a data sequence with the characteristics of the key words, making the characteristics more obvious and improving the accuracy of the model. The process of converting key words into word sequence vectors can use different encoding methods, such as one-hot encoding, bag of words model or word embedding model (such as Word2Vec, GloVe, etc.), and the specific selection is related to the characteristics of the data and the task requirements.
通过将关键词汇转化为序列向量,可以提取关键词汇在句子或段落中的顺序信息,并将其转化为可以量化和比较的数值特征。这可以帮助模型更好地理解关键词汇的上下文关系和语义信息,并从中学习到特定的模式和规律。通过编码关键词汇的顺序,可以为模型提供更明显的特征,提高模型对关键词汇的识别准确性。关键词汇的顺序编码可以帮助模型识别特定的搭配和短语结构,从而更好地理解违规词所在句子或段落的语义,并作出更准确的预测或分类。By converting key words into sequence vectors, the order information of key words in sentences or paragraphs can be extracted and converted into numerical features that can be quantified and compared. This can help the model better understand the contextual relationship and semantic information of key words and learn specific patterns and rules from them. By encoding the order of key words, more obvious features can be provided to the model, improving the model's recognition accuracy of key words. The sequential encoding of key words can help the model identify specific collocations and phrase structures, thereby better understanding the semantics of the sentence or paragraph where the offending word is located and making more accurate predictions or classifications.
进一步地,步骤S31中,所述根据关键词汇在句子或段落中的顺序对关键词汇进行编码,将关键词汇转换成词序列向量,具体采用one-hot编码、词袋模型或词嵌入模型进行。Furthermore, in step S31, the key words are encoded according to their order in the sentence or paragraph, and the key words are converted into word sequence vectors, specifically using one-hot encoding, bag-of-words model or word embedding model.
S32、以词序列向量作为特征,违规词作为标签,同时将采购文件数据样本划分为训练集和测试集,其中采用70%的数据样本作为训练集,30%的数据样本作为测试集;S32. Use word sequence vectors as features and illegal words as labels, and divide the procurement document data samples into training sets and test sets, where 70% of the data samples are used as training sets and 30% of the data samples are used as test sets;
S33、随机抽样:从训练集中随机抽取一定数量的数据样本,并有放回地构建多个含不同数据样本的训练集;S33, Random Sampling: Randomly extract a certain number of data samples from the training set, and construct multiple training sets containing different data samples with replacement;
S34、构建决策树:对于每个随机抽样的训练集,构建决策树模型;S34, constructing a decision tree: for each randomly sampled training set, construct a decision tree model;
进一步地,步骤S34中,所述决策树模型,配置为CART模型。Furthermore, in step S34, the decision tree model is configured as a CART model.
分类与回归树(Classification and Regression Trees,CART)是一种决策树算法,用于分类和回归分析。CART通过递归地将数据集划分为较小的子集,在每个节点上选择最佳特征和切分点,以最小化基尼不纯度(用于分类)或均方误差(用于回归),最终生成一棵树来进行预测。CART模型具有易于理解、可解释性强的特点,并且能够处理非线性关系和高维数据。通过对数据集的反复划分,CART能够有效地捕捉数据中的复杂模式,对缺失值和异常值也有一定的鲁棒性。在随机森林中,每棵决策树都是基于分类与回归树(CART)算法构建的。具体来说,在随机森林中的每棵树都是独立并且随机选择特征进行建模的CART决策树。随机森林通过集成多棵CART决策树的预测结果来做出最终的决策,从而提高了整体模型的准确性和泛化能力。每棵树都是针对训练数据的不同随机子集构建的,且在每个节点选择最佳特征时也只考虑了一个随机子集的特征,这使得随机森林具有一定的随机性,减少了过拟合的风险。因此,随机森林利用CART决策树的优点,综合多棵树的预测结果,提高了模型的稳定性和预测能力,在实际应用中取得了广泛的成功。Classification and Regression Trees (CART) is a decision tree algorithm used for classification and regression analysis. CART recursively divides the dataset into smaller subsets, selects the best features and split points at each node to minimize the Gini impurity (for classification) or the mean square error (for regression), and finally generates a tree for prediction. The CART model is easy to understand and interpretable, and can handle nonlinear relationships and high-dimensional data. By repeatedly dividing the dataset, CART can effectively capture complex patterns in the data and is also robust to missing values and outliers. In a random forest, each decision tree is built based on the Classification and Regression Tree (CART) algorithm. Specifically, each tree in a random forest is an independent CART decision tree that randomly selects features for modeling. Random forests make the final decision by integrating the prediction results of multiple CART decision trees, thereby improving the accuracy and generalization ability of the overall model. Each tree is built for a different random subset of the training data, and only one random subset of features is considered when selecting the best feature at each node, which makes the random forest have a certain degree of randomness and reduces the risk of overfitting. Therefore, the random forest takes advantage of the CART decision tree, combines the prediction results of multiple trees, improves the stability and prediction ability of the model, and has achieved widespread success in practical applications.
S35、集成决策树:将构建好的多个决策树模型整合成随机森林模型,利用投票机制做出最终的预测;S35, Ensemble Decision Tree: Integrate multiple decision tree models into a random forest model and use the voting mechanism to make the final prediction;
需要注意的是,在随机森林中,每棵决策树都是基于不同的训练数据集和特征子集构建的,具有一定的随机性。最终的预测结果可以通过投票机制或取平均值的方式来确定。在分类问题中,可以采用多数投票的方式,选择得票最多的类别作为最终预测结果;在回归问题中,可以将多棵树的预测结果取平均值作为最终输出。通过将多棵决策树整合成随机森林模型,可以有效减少过拟合的风险,提高模型的鲁棒性和准确性。It should be noted that in a random forest, each decision tree is built based on a different training data set and feature subset, and has a certain degree of randomness. The final prediction result can be determined by a voting mechanism or by taking the average. In classification problems, a majority vote can be used to select the category with the most votes as the final prediction result; in regression problems, the prediction results of multiple trees can be averaged as the final output. By integrating multiple decision trees into a random forest model, the risk of overfitting can be effectively reduced and the robustness and accuracy of the model can be improved.
S36、模型评估:使用评估指标评估随机森林模型对测试集的预测性能。S36. Model evaluation: Use evaluation metrics to evaluate the prediction performance of the random forest model on the test set.
进一步地,步骤S36中,所述评估指标包括准确率、精确率、召回率或F1值。Furthermore, in step S36, the evaluation index includes accuracy, precision, recall or F1 value.
对于随机森林模型的预测性能评估,常用的指标包括准确率、精确率、召回率和F1值。以下是对这些指标的说明:Commonly used indicators for evaluating the predictive performance of random forest models include accuracy, precision, recall, and F1 value. The following is a description of these indicators:
1)准确率(Accuracy):准确率表示模型预测正确的样本数量占总样本数量的比例,计算公式为:准确率 = 预测正确的样本数 / 总样本数。准确率提供了一个总体的评估指标,但在样本不平衡的情况下可能会产生偏误。1) Accuracy: Accuracy indicates the ratio of the number of samples correctly predicted by the model to the total number of samples. The calculation formula is: Accuracy = Number of samples correctly predicted / Total number of samples. Accuracy provides an overall evaluation indicator, but it may cause bias when the samples are unbalanced.
2)精确率(Precision):精确率表示模型预测为正类的样本中实际为正类的比例,计算公式为:精确率 = 真阳性(预测为正类且实际为正类) / (真阳性 + 假阳性)。精确率衡量了模型在预测为正类的样本中预测正确的能力。2) Precision: Precision indicates the proportion of samples predicted as positive by the model that are actually positive. The calculation formula is: Precision = True Positive (predicted as positive and actually positive) / (True Positive + False Positive). Precision measures the model's ability to make correct predictions among samples predicted as positive.
3)召回率(Recall):召回率表示实际为正类的样本中被模型预测为正类的比例,计算公式为:召回率 = 真阳性 / (真阳性 + 假阴性)。召回率衡量了模型识别正类样本的能力。3) Recall: Recall represents the proportion of samples that are actually positive that are predicted as positive by the model. The calculation formula is: Recall = True Positive / (True Positive + False Negative). Recall measures the ability of the model to identify positive samples.
4)F1值:F1值是精确率和召回率的调和平均值,可以综合考虑模型的精确度和召回度,计算公式为:F1 = 2 * (精确率 * 召回率) / (精确率 + 召回率)。F1值越高,表示模型在精确率和召回率之间取得了更好的平衡。4) F1 value: F1 value is the harmonic mean of precision and recall, which can comprehensively consider the precision and recall of the model. The calculation formula is: F1 = 2 * (precision * recall) / (precision + recall). The higher the F1 value, the better the balance between precision and recall of the model.
进一步地,报告生成模块中,所述合规性检查报告,内容包括采购信息、以违规词生成的所在句子或段落问题描述以及修改建议。Furthermore, in the report generation module, the compliance check report includes procurement information, a description of the problem in the sentence or paragraph generated with the illegal words, and modification suggestions.
在本实施例中,合规性检查报告的内容和结构大致如下:In this embodiment, the content and structure of the compliance check report are roughly as follows:
1)引言部分:1) Introduction:
在引言部分,简要介绍合规性检查的目的和背景,说明本报告的内容和范围。同时,提供关于本次合规性检查的基本信息,如时间、范围和参与人员等。In the introduction, briefly introduce the purpose and background of the compliance review, explain the content and scope of this report, and provide basic information about the compliance review, such as time, scope, and participants.
2)采购信息:2) Purchasing information:
在这一部分,列出采购信息,包括采购的产品或服务、供应商信息、采购时间、采购金额等。确保提供足够的细节,以便用户了解检查的上下文和背景。In this section, list the purchase information, including the products or services purchased, supplier information, purchase time, purchase amount, etc. Make sure to provide enough details so that users understand the context and background of the inspection.
3)违规词所在句子或段落问题描述:3) Description of the sentence or paragraph where the offending word is located:
在这一部分,逐个描述违规词所在句子或段落的问题,对每个问题进行简要描述,指出使用了哪些违规词,这些违规词在句子或段落中的位置以及对合规性的影响。可以给出具体的例子,并在需要时给予截图或引用原始内容。In this section, describe the problems with the sentences or paragraphs where the offending words are located one by one. For each problem, briefly describe which offending words are used, where they are located in the sentence or paragraph, and the impact on compliance. You can give specific examples and provide screenshots or quotes to the original content when necessary.
4)修改建议:4) Suggestions for modification:
在这一部分,提供对每个问题的修改建议。针对每个违规词的使用,给出合规的替代词汇或表达方式。同时,提供关于修改句子或段落结构、重新组织内容或更改表达方式等的指导。建议应具体、明确,并尽可能解释为什么做出这样的修改。In this section, provide suggestions for each problem. For each use of the offending word, provide a compliant alternative word or expression. Also, provide guidance on modifying sentence or paragraph structure, reorganizing content, or changing expression. Suggestions should be specific and clear, and explain why such changes are made as much as possible.
5)总结:5) Summary:
在总结部分,对合规性检查的结果进行总结,简要概括检查发现的问题和建议的解决方案。强调合规性的重要性,并鼓励采取措施解决问题并遵守合规规定。In the summary section, summarize the results of the compliance check, briefly summarize the problems found and the recommended solutions. Emphasize the importance of compliance and encourage taking measures to solve problems and comply with compliance regulations.
在生成合规性检查报告时,确保清晰地传达问题和建议,使用简明扼要的语言,并提供足够的支持材料以支持报告中的结论。同时,遵循任何组织或行业特定的要求或指南,以确保报告的一致性和可读性。When generating compliance inspection reports, ensure that the issues and recommendations are clearly communicated, use concise language, and provide sufficient supporting material to support the conclusions in the report. Also, follow any organization- or industry-specific requirements or guidelines to ensure consistency and readability of the report.
进一步地,所述报告生成模块,还根据用户历史时期对违规词的纠错记录进行训练,优化违规词检查结果,以标注重点违规句段。Furthermore, the report generation module is also trained based on the user's historical correction records of illegal words, and optimizes the illegal word checking results to mark key illegal sentences.
在日常稽查过程中,可能将采购文件的正常词、句或段错误判定为违规,需要用户根据经验从生成的违规词条进行筛选正确的稽查规则。为此,本申请通过用户历史时期对违规词的纠错记录进行训练,以记录用户的纠错习惯,完善采购文件的稽查规则,优化违规词检查结果,最终将错误判定为违规的句段删除,标注出重点违规句段。其中对违规词的纠错记录进行训练可采用机器学习算法,如支持向量机、随机森林等。During the daily audit process, normal words, sentences or paragraphs in the procurement documents may be mistakenly judged as violations, and users need to screen the correct audit rules from the generated illegal entries based on experience. To this end, this application is trained through the user's historical correction records of illegal words to record the user's correction habits, improve the audit rules of procurement documents, optimize the inspection results of illegal words, and finally delete the sentences and paragraphs that are mistakenly judged as violations and mark the key illegal sentences and paragraphs. Machine learning algorithms such as support vector machines and random forests can be used to train the correction records of illegal words.
本发明的有益效果:Beneficial effects of the present invention:
1)通过识别违规词以及违规词所在句子或段落中的其它关键词汇,构建关键词汇与违规词之间的联系,解决了现有技术中未识别与违规词相关搭配的词汇,导致对违规词的词性判断不够准确的问题;1) By identifying the offending words and other key words in the sentence or paragraph where the offending words are located, the connection between the key words and the offending words is established, which solves the problem that the existing technology does not identify the words related to the offending words, resulting in inaccurate part of speech judgment of the offending words;
2)对于不同采购文件类型,通过识别采购文件的采购信息并抽取与其存在依赖关系的关键词汇,界定了不同采购文件类型违规词的关键词汇,提高对违规词词性的判断;2) For different types of procurement documents, by identifying the procurement information of the procurement documents and extracting the key words that have a dependency relationship with it, the key words of illegal words in different types of procurement documents are defined to improve the judgment of the part of speech of illegal words;
3)构建关键词汇与违规词之间的联系过程中采用随机森林模型,提高二者关系量化的准确性;3) A random forest model is used in the process of building the connection between key words and illegal words to improve the accuracy of quantifying the relationship between the two;
4)根据关键词汇在句子或段落中的顺序对关键词汇进行编码,将关键词汇转换成词序列向量,使关键词汇特征突出,提高模型预测的准确性。4) Encode the key words according to their order in the sentence or paragraph, and convert them into word sequence vectors to highlight the key word features and improve the accuracy of model prediction.
以上所述,仅是本发明的较佳实施例而已,并非对本发明作任何形式上的限制,虽然本发明已以较佳实施例揭示如上,然而并非用以限定本发明,任何本领域技术人员,在不脱离本发明技术方案范围内,当可利用上述揭示的技术内容做出些许更动或修饰为等同变化的等效实施例,但凡是未脱离本发明技术方案内容,依据本发明的技术实质对以上实施例所作的任何简介修改、等同变化与修饰,均仍属于本发明技术方案的范围内。The above description is only a preferred embodiment of the present invention and does not limit the present invention in any form. Although the present invention has been disclosed as a preferred embodiment as above, it is not used to limit the present invention. Any technical personnel in this field can make some changes or modify the technical contents disclosed above into equivalent embodiments without departing from the scope of the technical solution of the present invention. However, any brief modifications, equivalent changes and modifications made to the above embodiments based on the technical essence of the present invention without departing from the content of the technical solution of the present invention are still within the scope of the technical solution of the present invention.
Claims (9)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411012102.5A CN118551760B (en) | 2024-07-26 | 2024-07-26 | Purchasing file compliance checking system based on difference algorithm under AI large model |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411012102.5A CN118551760B (en) | 2024-07-26 | 2024-07-26 | Purchasing file compliance checking system based on difference algorithm under AI large model |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN118551760A CN118551760A (en) | 2024-08-27 |
| CN118551760B true CN118551760B (en) | 2024-11-05 |
Family
ID=92453146
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202411012102.5A Active CN118551760B (en) | 2024-07-26 | 2024-07-26 | Purchasing file compliance checking system based on difference algorithm under AI large model |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN118551760B (en) |
Families Citing this family (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118966477A (en) * | 2024-10-17 | 2024-11-15 | 深圳市易仓科技有限公司 | A supply chain path optimization method and system for cross-border e-commerce |
| CN119886103A (en) * | 2024-12-27 | 2025-04-25 | 北京市大数据中心 | Semantic analysis and keyword-driven document front-to-back consistency comparison method |
| CN120278685A (en) * | 2025-06-12 | 2025-07-08 | 博思数采科技股份有限公司 | Intelligent examination method for purchasing file qualification fairness based on multi-technology fusion |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112686036A (en) * | 2020-08-18 | 2021-04-20 | 平安国际智慧城市科技股份有限公司 | Risk text recognition method and device, computer equipment and storage medium |
| CN116720515A (en) * | 2023-06-05 | 2023-09-08 | 上海识装信息科技有限公司 | Sensitive word review methods, storage media and electronic devices based on large language models |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110851590A (en) * | 2019-09-11 | 2020-02-28 | 上海爱数信息技术股份有限公司 | Method for classifying texts through sensitive word detection and illegal content recognition |
| CN110765761A (en) * | 2019-09-16 | 2020-02-07 | 平安科技(深圳)有限公司 | Contract sensitive word checking method and device based on artificial intelligence and storage medium |
| CN113221541A (en) * | 2021-07-09 | 2021-08-06 | 清华大学 | Data extraction method and device |
| CN117520565A (en) * | 2023-12-13 | 2024-02-06 | 信雅达科技股份有限公司 | Risk identification method based on knowledge graph |
-
2024
- 2024-07-26 CN CN202411012102.5A patent/CN118551760B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112686036A (en) * | 2020-08-18 | 2021-04-20 | 平安国际智慧城市科技股份有限公司 | Risk text recognition method and device, computer equipment and storage medium |
| CN116720515A (en) * | 2023-06-05 | 2023-09-08 | 上海识装信息科技有限公司 | Sensitive word review methods, storage media and electronic devices based on large language models |
Also Published As
| Publication number | Publication date |
|---|---|
| CN118551760A (en) | 2024-08-27 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN110020660B (en) | Integrity assessment of unstructured processes using Artificial Intelligence (AI) techniques | |
| JP7664262B2 (en) | Cross-document intelligent authoring and processing assistant | |
| CN118551760B (en) | Purchasing file compliance checking system based on difference algorithm under AI large model | |
| US11403465B2 (en) | Systems and methods for report processing | |
| US20230394235A1 (en) | Domain-specific document validation | |
| CN117707922A (en) | Method and device for generating test case, terminal equipment and readable storage medium | |
| CN114444477B (en) | Administrative law enforcement case quality supervision method and system | |
| US20220366346A1 (en) | Method and apparatus for document evaluation | |
| CN118228715B (en) | A method, device and medium for automatically checking the content of a work report | |
| CN120124612A (en) | A method for automatic generation of audit reports based on natural language processing | |
| CN119047450A (en) | Contract template data processing method and device and computer equipment | |
| CN119476938A (en) | Automated contract text review method and system based on unified clause modeling | |
| CN111881294B (en) | Corpus labeling system, corpus labeling method and storage medium | |
| CN119940322B (en) | A method and system for generating rational drug use reports combined with artificial intelligence | |
| Nasfi et al. | Improving data cleaning by learning from unstructured textual data | |
| CN119783660A (en) | Text error recognition method and related device based on large model | |
| CN119830883A (en) | Contract template collaborative editing system and method | |
| CN118982329A (en) | BIM model information review method, device and computer program | |
| CN119180266A (en) | Historical data-based audit opinion generation method, device and equipment | |
| WO2025019581A1 (en) | Data digitization via custom integrated machine learning ensembles | |
| CN118966181A (en) | A personalized privacy policy generation method, system device and medium | |
| Oumoussa et al. | Automated Microservices Identification through Business Process Analysis: A Semantic-driven Clustering Approach | |
| Li et al. | An Accounting Classification System Using Constituency Analysis and Semantic Web Technologies | |
| CN119692322B (en) | Abnormal copywriting rewriting method and device, electronic device and storage medium | |
| CN118350353B (en) | Method and system for controlling on-line document editing structuring segmentation cutting and item |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |