[go: up one dir, main page]

CN116719902A - Essay off-topic detection and scoring system and method based on latent semantic indexing - Google Patents

Essay off-topic detection and scoring system and method based on latent semantic indexing Download PDF

Info

Publication number
CN116719902A
CN116719902A CN202310674533.7A CN202310674533A CN116719902A CN 116719902 A CN116719902 A CN 116719902A CN 202310674533 A CN202310674533 A CN 202310674533A CN 116719902 A CN116719902 A CN 116719902A
Authority
CN
China
Prior art keywords
unit
scoring
similarity
article
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310674533.7A
Other languages
Chinese (zh)
Inventor
何经武
曾凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Youlixin Technology Co ltd
Original Assignee
Jiangsu Youlixin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Youlixin Technology Co ltd filed Critical Jiangsu Youlixin Technology Co ltd
Priority to CN202310674533.7A priority Critical patent/CN116719902A/en
Publication of CN116719902A publication Critical patent/CN116719902A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of natural language processing, in particular to a composition running question detection scoring system and method based on potential semantic indexes, wherein a data collection and preprocessing module is used for collecting compositions and corresponding questions thereof and preprocessing collected articles; the term matrix module is used for establishing a term-document matrix and normalizing the result; the latent semantic indexing module is used for carrying out singular value decomposition on the matrix according to the latent semantic index and calculating the similarity between the article and the template; the running question detection module is used for comprehensively detecting whether the article runs questions or not by using the similarity results of the TF-IDF algorithm and the potential semantic indexing module; the scoring module is used for comparing according to the detection result and a preset scoring standard and giving the score of the composition; the output module is used for outputting the article running question detection result and the scoring result of the article to the user.

Description

基于潜在语义索引的作文跑题检测评分系统及方法Essay off-topic detection and scoring system and method based on latent semantic indexing

技术领域Technical field

本发明涉及自然语言处理技术领域,具体为基于潜在语义索引的作文跑题检测评分系统及方法。The present invention relates to the technical field of natural language processing, specifically to an essay off-topic detection and scoring system and method based on latent semantic indexing.

背景技术Background technique

作文是教学中重要的内容之一,也是考核学生语言表达能力和思维水平的有效方式。然而,作文评阅是一项耗时费力的工作,需要教师对每篇作文进行细致的阅读、评分和批改。此外,由于教师的主观性、经验性和个体差异,作文评分往往存在一定的不公正性和不一致性。因此,如何利用计算机技术实现作文的自动评阅,提高评阅效率和质量,是当前教育领域面临的一个重要问题。Composition is one of the important contents in teaching, and it is also an effective way to assess students' language expression ability and thinking level. However, essay review is a time-consuming and labor-intensive task that requires teachers to read, grade and mark each essay carefully. In addition, due to teachers' subjectivity, experience and individual differences, there is often a certain degree of unfairness and inconsistency in essay scoring. Therefore, how to use computer technology to automatically review essays and improve review efficiency and quality is an important issue facing the current education field.

目前,已有一些基于自然语言处理技术的作文自动评阅系统被提出和应用。这些系统通常采用监督学习的方法,利用少量人工评分的样本学习一个预测模型,根据不同的特征来描述和评价作文的质量。例如,传统的方法利用自然语言处理浅层分析的结果构建特征,如文章的长度、段落数、词汇丰富性等。近年来,基于深度学习的端到端学习方法也被应用于作文评分,作文被抽象地表示为分布式向量。At present, some automatic essay review systems based on natural language processing technology have been proposed and applied. These systems usually use supervised learning methods, using a small number of manually rated samples to learn a prediction model to describe and evaluate the quality of compositions based on different features. For example, traditional methods use the results of shallow analysis of natural language processing to construct features, such as the length of the article, the number of paragraphs, and the richness of vocabulary. In recent years, end-to-end learning methods based on deep learning have also been applied to essay scoring, and essays are abstractly represented as distributed vectors.

然而,这些方法存在一些不足之处。首先,它们往往忽略了作文与给定题目之间的关系,无法有效地检测作文是否跑题或偏题。其次,它们往往缺乏对作文内容和结构的深入理解,无法准确地评价作文在主题、论点、论据、篇章结构等方面的优劣。第三,它们往往缺乏可解释性,无法给出具体的评分依据和改进建议。However, these methods have some shortcomings. First, they often ignore the relationship between the composition and the given topic, and cannot effectively detect whether the composition is off-topic or off-topic. Secondly, they often lack an in-depth understanding of the composition's content and structure, and cannot accurately evaluate the composition's strengths and weaknesses in terms of themes, arguments, arguments, and chapter structure. Third, they often lack interpretability and cannot provide specific scoring basis and improvement suggestions.

发明内容Contents of the invention

本发明的目的在于提供基于潜在语义索引的作文跑题检测评分系统及方法,以解决上述背景技术中提出的问题。The purpose of the present invention is to provide an essay off-topic detection and scoring system and method based on latent semantic indexing to solve the problems raised in the above background technology.

为了解决上述技术问题,本发明提供如下技术方案:基于潜在语义索引的作文跑题检测评分系统,所述系统包括数据收集和预处理模块、术语矩阵模块、潜在语义索引模块、跑题检测模块、评分模块和输出模块,In order to solve the above technical problems, the present invention provides the following technical solution: an essay off-topic detection and scoring system based on latent semantic indexing. The system includes a data collection and preprocessing module, a term matrix module, a latent semantic index module, an off-topic detection module, and a scoring module. and output modules,

所述数据收集和预处理模块用于收集作文和其对应的题目,并对采集的文章进行预处理;The data collection and preprocessing module is used to collect compositions and their corresponding topics, and preprocess the collected articles;

所述术语矩阵模块用于建立术语-文档矩阵,对结果进行归一化;The term matrix module is used to establish a term-document matrix and normalize the results;

所述潜在语义索引模块用于根据潜在语义索引来对矩阵进行奇异值分解,并计算文章与范文之间的相似度;The latent semantic index module is used to perform singular value decomposition on the matrix based on the latent semantic index and calculate the similarity between the article and the sample article;

所述跑题检测模块用于使用TF-IDF*算法和潜在语义索引模块的相似度结果,综合对文章是否跑题进行检测;The off-topic detection module is used to use the similarity results of the TF-IDF* algorithm and the latent semantic index module to comprehensively detect whether the article is off-topic;

所述评分模块用于根据检测结果和预设的评分标准进行对比,给出作文的分数;The scoring module is used to compare the test results with the preset scoring standards and give the score of the composition;

所述输出模块用于向用户输出文章跑题检测结果和文章的评分结果。The output module is used to output the article off-topic detection results and the article scoring results to the user.

本发明使用潜在语义索引来降低术语-文档矩阵的维数,提高了文本处理的效率和效果,可识别出传统向量空间模型中不明显的术语和文档之间的关系,提高了识别度,通过识别和利用文档集合中的底层结构来提高检索的准确性,可以处理同义词和多义词,使用TF-IDF*算法提高文章跑题检测的准确性,可以更好的判断文章是否跑题,优化了关键词的提取。This invention uses latent semantic indexing to reduce the dimensionality of the term-document matrix, improves the efficiency and effect of text processing, can identify the relationship between terms and documents that are not obvious in the traditional vector space model, and improves the recognition degree. Identify and utilize the underlying structure in the document collection to improve the accuracy of retrieval. It can handle synonyms and polysemy. It uses the TF-IDF* algorithm to improve the accuracy of article off-topic detection. It can better judge whether the article is off-topic and optimize the keywords. extract.

进一步的,所述数据收集和预处理模块包括数据采集单元和预处理单元,所述数据采集单元由数据抓取单元和数据存储单元组成,所述数据抓取单元用于对作文和作文的题目进行抓取收集,所述数据存储单元用于将数据抓取单元获取的数据进行存储,所述预处理单元由数据清洗单元和英文转换单元组成,所述数据清洗单元用于将文章的停用词和标点符号进行删除,所述英文转换单元用于将文章中的所有英文单词转换为小写英文。Further, the data collection and preprocessing module includes a data collection unit and a preprocessing unit. The data collection unit is composed of a data capture unit and a data storage unit. The data capture unit is used to analyze compositions and composition topics. Carry out crawling and collection. The data storage unit is used to store the data obtained by the data crawling unit. The preprocessing unit is composed of a data cleaning unit and an English conversion unit. The data cleaning unit is used to deactivate the article. Words and punctuation marks are deleted, and the English conversion unit is used to convert all English words in the article into lowercase English.

本发明通过对文章和其对应的题目进行收集,可以更好的匹配文章和题目的关联性,将数据存储于数据存储单元,可以让预处理更加的便捷,通过对文章进行预处理,可以减少冗余信息,提高文本的处理效率,删除标点符号可以使文本更加的规范化,易于处理,提高结果的准确性,将英文转化为小写,消除了大小写的干扰,提高数据的一致性,使文本格式更加统一规整,使数据处理更加的高效方便。By collecting articles and their corresponding topics, the present invention can better match the relevance of articles and topics, and store data in a data storage unit, which can make preprocessing more convenient. By preprocessing articles, it can reduce Redundant information improves the efficiency of text processing. Removing punctuation marks can make the text more standardized and easier to process, improving the accuracy of the results. Converting English to lowercase eliminates the interference of uppercase and lowercase letters, improves the consistency of data, and makes the text The format is more unified and regular, making data processing more efficient and convenient.

进一步的,所述术语矩阵模块包括矩阵创立单元和归一化单元,所述矩阵创立单元用于创立术语-文档矩阵,所述归一化单元用于将矩阵的表示方法进行归一化处理,以便于进行相似度计算。Further, the term matrix module includes a matrix creation unit and a normalization unit. The matrix creation unit is used to create a term-document matrix. The normalization unit is used to normalize the representation method of the matrix. to facilitate similarity calculation.

本发明通过创立术语-文档矩阵,将矩阵进行归一化处理,方便相似度计算,将文档表示为向量,提高计算效率。The present invention creates a term-document matrix, normalizes the matrix, facilitates similarity calculation, expresses documents as vectors, and improves calculation efficiency.

进一步的,所述潜在语义索引模块包括奇异值分解单元和相似度检测单元,所述奇异值分解单元用于对矩阵进行奇异值分解,降低数据的维度,所述相似度检测单元用于使用余弦相似度来计算文章和范文之间的相似度。Further, the latent semantic index module includes a singular value decomposition unit and a similarity detection unit. The singular value decomposition unit is used to perform singular value decomposition on the matrix to reduce the dimension of the data. The similarity detection unit is used to use cosine Similarity is used to calculate the similarity between the article and the sample article.

本发明对术语-文档矩阵进行奇异值分解,降低了数据维度,提高文本处理的效率和效果,通过识别和利用集合中的底层结构来提高检索准确性,使用余弦相似度来计算文章和题目的贴合度,不需要大量矩阵计算,提高了计算速度,具有更高的精确度,并且适用于高维度向量。This invention performs singular value decomposition on the term-document matrix, reduces the data dimension, improves the efficiency and effect of text processing, improves retrieval accuracy by identifying and utilizing the underlying structure in the collection, and uses cosine similarity to calculate the number of articles and topics. Good fit, does not require a lot of matrix calculations, improves calculation speed, has higher accuracy, and is suitable for high-dimensional vectors.

进一步的,所述跑题检测模块包括TF-IDF*算法单元和检测单元,所述TF-IDF*算法单元用于对文章是否跑题进行计算,所述检测单元用于综合考虑TF-IDF*算法和潜在语义索引的相似度计算结果,对文章是否跑题进行检测。Further, the off-topic detection module includes a TF-IDF* algorithm unit and a detection unit. The TF-IDF* algorithm unit is used to calculate whether the article is off-topic. The detection unit is used to comprehensively consider the TF-IDF* algorithm and The similarity calculation results of latent semantic index are used to detect whether the article is off-topic.

本发明使用了TF-IDF*算法,通过对原始TF-IDF算法进行改进,综合考虑TF-IDF*算法和潜在语义索引的相似度结果,对文章是否跑题进行检测,提高了文本相似性计算的准确性,优化了关键词的抽取,可以得到更加准确的检测结果。The present invention uses the TF-IDF* algorithm, improves the original TF-IDF algorithm, comprehensively considers the similarity results of the TF-IDF* algorithm and the latent semantic index, detects whether the article is off-topic, and improves the efficiency of text similarity calculation. Accuracy, optimized keyword extraction, and more accurate detection results can be obtained.

进一步的,所述评分模块包括相似度对比单元和评分单元,所述相似度对比单元用于将文章相似度和系统标准进行比较,划分出不同的等级,所述评分单元用于根据相似度对比单元的等级结果和评分标准进行评分。Further, the scoring module includes a similarity comparison unit and a scoring unit. The similarity comparison unit is used to compare the similarity of articles with system standards and divide them into different levels. The scoring unit is used to compare articles based on similarity. Units are graded based on grade results and rubrics.

本发明通过将文章相似度和系统标准进行比较,划分出不同等级,并将等级与评分标准进行对比,得出文章的评分,不需要复杂的计算,评分方法更加简单明了,可以快速得到评估结果。The present invention divides different grades by comparing the similarity of the article with the system standards, and compares the grades with the scoring standards to obtain the score of the article. No complicated calculations are required. The scoring method is simpler and clearer, and the evaluation results can be obtained quickly. .

进一步的,所述输出模块包括可视化单元和报告生成器,所述可视化单元用于将文章的相似度和评分结果转化为图表形式,所述报告生成器用于将文章相似度和评分结果生成报告。Further, the output module includes a visualization unit and a report generator, the visualization unit is used to convert the article similarity and scoring results into chart form, and the report generator is used to generate a report based on the article similarity and scoring results.

本发明通过可视化将文章的相似度和评分结果转化为图表形式,更加的直观清楚,将文章相似度和评分结果生成报告,提高了数据可视化效果,便于监测结果,可以方便老师更好的为学生提供建议。The present invention converts article similarity and scoring results into chart form through visualization, which is more intuitive and clear. It generates reports on article similarity and scoring results, improves the data visualization effect, facilitates monitoring of results, and facilitates teachers to provide better services to students. Give suggestion.

基于潜在语义索引的作文跑题检测方法,所述方法包括以下步骤:An essay off-topic detection method based on latent semantic indexing, which method includes the following steps:

步骤S100:数据采集单元对文章和文章对应的题目进行收集;Step S100: The data collection unit collects articles and topics corresponding to the articles;

步骤S200:对文章进行预处理,删除停用词和标点符号,并将英文转换为小写;Step S200: Preprocess the article, delete stop words and punctuation marks, and convert English to lowercase;

步骤S300:创立术语-文档矩阵,每一行对应术语,每一列对应文档;Step S300: Create a term-document matrix, with each row corresponding to a term and each column corresponding to a document;

令A_m×n为术语-文档矩阵,其行对应于术语,其列对应于文档。令d为A的ih列向量,i=1,2,…,n。为了比较文档i和文档j之间的相似性,我们只需执行以下操作:Let A_m×n be a term-document matrix whose rows correspond to terms and whose columns correspond to documents. Let d be the ih column vector of A, i=1,2,…,n. To compare the similarity between document i and document j, we simply do the following:

D·d,或矩阵表示法dd,为了对结果进行归一化,我们将结果除以它们的范数,并将这个归一化的结果称为余弦相似度。cos(θ)=d·d /‖d‖‖d‖其中-1≤cos(θ)≤1θ是矢量d和矢量d之间的夹角,余弦相似度值越接近1,说明文档i和文档j的相似度越高,D·d, or dd in matrix notation, to normalize the results, we divide the results by their norm and call this normalized result cosine similarity. cos(θ)=d·d/‖d‖‖d‖where -1≤cos(θ)≤1θ is the angle between vector d and vector d. The closer the cosine similarity value is to 1, it means that document i and document The higher the similarity of j, the

步骤S400:潜在语义索引模块对矩阵进行奇异值分解,并进行余弦相似度计算;Step S400: The latent semantic index module performs singular value decomposition on the matrix and performs cosine similarity calculation;

所述奇异值分解方法:The singular value decomposition method:

通过奇异值分解(SVD),A可以分解为:A_m×n=U_m×m·Σ_m×n·V_n×n;U=[u_1,u_2,···,u_m],Σ=diag{σ_1,σ_2,···,σ_p},V=[v_1,v_2,···,v_n],其中U是AA的正交矩阵,满足AA=U∧1U,其中∧1是对角线,V是AA的正交矩阵,满足AA=V∧2V,其中∧2是对角线,Σ是一个对角矩阵,它包含AA或AA的特征值的平方根,在其对角线上按降序排列,Through singular value decomposition (SVD), A can be decomposed into: A_m×n=U_m×m·Σ_m×n·V_n×n; U=[u_1,u_2,···,u_m],Σ=diag{σ_1,σ_2 ,···,σ_p}, V=[v_1, v_2,···,v_n], where U is the orthogonal matrix of AA, satisfying AA=U∧ 1 U, where ∧ 1 is the diagonal and V is AA is an orthogonal matrix that satisfies AA=V∧ 2 V, where ∧ 2 is the diagonal, and Σ is a diagonal matrix that contains AA or the square root of the eigenvalues of AA, arranged in descending order on its diagonal,

A的奇异值分解也可以写成更紧凑的形式如下:The singular value decomposition of A can also be written in a more compact form as follows:

A_m×n=U_m×p,Σ_p×p,V_p×n,U=[u_1,u_2,···,u_p],Σ=diag{σ_1,σ_2,···,σ_p},V=[v_1,v_2,···,v_p],p=最小值(m,n),A_m×n=U_m×p, Σ_p×p, V_p×n, U=[u_1, u_2,···,u_p], Σ=diag{σ_1,σ_2,···,σ_p}, V=[v_1, v_2,···,v_p], p=minimum value (m, n),

接下来,从Σ中选择k个最大的σ(σ_1,σ_2,···,σ_k),以及它们从U和V中对应的向量u(u_1,u_2,···,u_k)和v(v_1,v_2,···,v_k)以得到对A的秩k近似最小误差(Frobenius范数)将此近似值写为:A*_m×n=U_m×k·Σ*_k×k·V_k×n,U=[u_1,u_2,···,u_k],Σ*=diag{σ_1,σ_2,···,σ_k},V=[v_1,v_2,···,v_k],k≤min(m,n)=p,Next, select the k largest σ(σ_1,σ_2,···,σ_k) from Σ, and their corresponding vectors u(u_1,u_2,···,u_k) and v(v_1 from U and V , v_2,···,v_k) to obtain the minimum error (Frobenius norm) of the rank k approximation of A. Write this approximation as: A*_m×n=U_m×k·Σ*_k×k·V_k×n, U=[u_1,u_2,···,u_k], Σ*=diag{σ_1,σ_2,···,σ_k}, V=[v_1, v_2,···,v_k], k≤min(m, n)=p,

不使用AA中包含的结果,即d·d,而是使用A*A*中包含的近似结果,即d*·d*/>来比较文档之间的相似度i和文档j,d是A的第ih列向量,i=1,…,n和d*是A*的第ih列向量,i=1,…,n,(Σ*_k×kw)·(Σ*_k×kw),或者用矩阵表示法(Σ*_k×kw)(Σ*_k×kw),Do not use the results contained in AA, i.e. d·d, Instead, use the approximate result contained in A*A*, which is d*·d*/> To compare the similarity between documents i and document j, d is the ih-th column vector of A, i=1,…,n and d* is the ih-th column vector of A*, i=1,…,n, ( Σ*_k×kw)·(Σ*_k×kw), or using matrix notation (Σ*_k×kw)(Σ*_k×kw),

其中w是V的ih列向量,where w is the ih column vector of V,

A*A*=(U_m×m·Σ*_m×n·V_n×n)·(U_m×m·Σ*_m×n·V_n×n),A*A*=(U_m×m·Σ*_m×n·V_n×n)·(U_m×m·Σ*_m×n·V_n×n),

Σ*_m×n=diag{σ_1,σ_2,···,σ_k},Σ*_k×k=diag{σ_1,σ_2,···,σΣ*_m×n=diag{σ_1,σ_2,···,σ_k}, Σ*_k×k=diag{σ_1,σ_2,···,σ

_k},_k},

所述相似度计算方法:The similarity calculation method:

q*=U_m×kq,为了比较新文档查询和文档i之间的相似度,计算q*和Σ*_k×kwq*=U_m×kq, in order to compare the similarity between the new document query and document i, calculate q* and Σ*_k×kw

之间的余弦相似度,如下所示:cos(θ)=q*·(Σ*_k×kw)/‖q*‖‖Σ*_k×kw‖其中-1≤cos(θ)≤1,θ是向量q*和向量Σ*_k×kw的夹角,The cosine similarity between them is as follows: cos(θ)=q*·(Σ*_k×kw)/‖q*‖‖Σ*_k×kw‖where -1≤cos(θ)≤1, θ is the angle between vector q* and vector Σ*_k×kw,

步骤S500:跑题检测模块使用TF-IDF*算法和潜在语义索引模块的相似度结果,综合对文章是否跑题进行检测;Step S500: The off-topic detection module uses the TF-IDF* algorithm and the similarity results of the latent semantic index module to comprehensively detect whether the article is off-topic;

TF-IDF*算法由传统的TF-IDF算法改进而来,The TF-IDF* algorithm is improved from the traditional TF-IDF algorithm.

传统的TF-IDF算法公式为:TF-IDF=TF·IDF,The traditional TF-IDF algorithm formula is: TF-IDF=TF·IDF,

其中ni,j表示词条ti在文档dj中出现的次数,TFi,j表示词条ti在文档dj中出现的频率,where n i,j represents the number of times the term ti appears in the document dj, TFi,j represents the frequency of the term ti appearing in the document dj,

其中|D|表示所有文档的数量,|j:ti∈dj|表示包含词条ti的文档数量,where |D| represents the number of all documents, |j:ti∈dj| represents the number of documents containing the term ti,

改进后的TF-IDF*算法公式为: The improved TF-IDF* algorithm formula is:

其中TF表示词条在文本中出现的频率,N表示语料库中的文档总数,IDF表示包含词条的文档数,L表示文档的平均长度,I表示词条的长度,Among them, TF represents the frequency of the term appearing in the text, N represents the total number of documents in the corpus, IDF represents the number of documents containing the term, L represents the average length of the document, and I represents the length of the term.

改进:Improve:

(1)对TF取对数,以平滑文章高频词的权重;(1) Take the logarithm of TF to smooth the weight of high-frequency words in the article;

(2)对IDF加上一个常数0.5,以减少文章低频词的权重;(2) Add a constant 0.5 to the IDF to reduce the weight of low-frequency words in the article;

公式原理:Formula principle:

(1)一个词在文档中出现的次数越多,说明它越能反映文档的主题,所以TF是正相关的;(1) The more times a word appears in a document, the more it reflects the topic of the document, so TF is positively correlated;

(2)一个词在语料库中出现的文档越少,说明它越能区分不同的文档,所以IDF是负相关的;(2) The fewer documents a word appears in the corpus, the better it can distinguish between different documents, so IDF is negatively correlated;

(3)一个词条的长度越长,说明它越具有特征性,所以词条长度是正相关的;(3) The longer the length of an entry, the more characteristic it is, so the length of the entry is positively correlated;

(4)一个文档的长度越长,说明它包含的信息越多,所以文档长度是负相关的;(4) The longer a document is, the more information it contains, so document length is negatively correlated;

TF-IDF*的计算结果绝对值越大,表示重要程度越高,更能够反映主题,即更贴合主题,绝对值越小,表示重要程度越低,不能够反映主题,即跑题,The larger the absolute value of the calculation result of TF-IDF*, the higher the importance, and the better it can reflect the topic, that is, it is more relevant to the topic. The smaller the absolute value, the lower the importance, which cannot reflect the topic, that is, it is off-topic.

跑题检测方法:Off-topic detection method:

P=|0.4cosθ|+0.6|(TF-IDF*)|,P=|0.4cosθ|+0.6|(TF-IDF*)|,

其中P表示文章的跑题检测指标,cosθ表示余弦相似度,TF-IDF*表示词频逆文档频率,当P≥a时,文章没有跑题,当P<a时,文章判断为跑题,a为系统预设的阈值,Among them, P represents the off-topic detection index of the article, cosθ represents the cosine similarity, TF-IDF* represents the inverse document frequency of word frequency. When P≥a, the article is not off-topic. When P<a, the article is judged to be off-topic. a is the system predetermined Set the threshold value,

步骤S600:评分模块根据相似度对应的标准进行评分;Step S600: The scoring module scores according to the standard corresponding to the similarity;

步骤S700:将文章的相似度和评分结果转化为图表进行显示,并生成报告。Step S700: Convert the similarity and scoring results of the article into charts for display, and generate a report.

与现有技术相比,本发明所达到的有益效果是:Compared with the prior art, the beneficial effects achieved by the present invention are:

(1)本发明使用潜在语义索引来降低术语-文档矩阵的维数,提高了文本处理的效率和效果,通过识别出传统向量空间模型中不明显的术语和文档之间的关系,提高识别度,识别和利用文档集合中的底层结构来提高检索的准确性;(1) This invention uses latent semantic indexing to reduce the dimensionality of the term-document matrix, improve the efficiency and effect of text processing, and improve recognition by identifying the relationship between terms and documents that are not obvious in the traditional vector space model. , identify and exploit underlying structures in document collections to improve retrieval accuracy;

(2)本发明对文章进行预处理,减少了冗余信息,提高了文本的处理效率,删除标点符号使文本更加的规范化,易于处理,提高结果的准确性,使文本格式统一化,使数据处理更加高效方便;(2) The present invention preprocesses the article, reduces redundant information, improves text processing efficiency, deletes punctuation marks to make the text more standardized and easier to process, improves the accuracy of the results, unifies the text format, and makes the data Processing is more efficient and convenient;

(3)本发明使用奇异值分解,降低了数据维度,提高文本处理效率,使用余弦相似度来计算文章和题目的贴合度,提高了计算速度,具有更高的精确度;(3) The present invention uses singular value decomposition to reduce the data dimension and improve text processing efficiency. It uses cosine similarity to calculate the fit between the article and the title, which improves the calculation speed and has higher accuracy;

(4)本发明使用了TF-IDF*算法,通过对原始TF-IDF算法进行改进,综合考虑TF-IDF*算法和潜在语义索引的相似度结果,对文章是否跑题进行检测,提高了文本相似性计算的准确性,优化了关键词的抽取,可以得到更加准确的检测结果;(4) The present invention uses the TF-IDF* algorithm. By improving the original TF-IDF algorithm and comprehensively considering the similarity results of the TF-IDF* algorithm and the latent semantic index, it detects whether the article is off-topic and improves text similarity. The accuracy of the calculation is optimized, and the extraction of keywords is optimized to obtain more accurate detection results;

(5)本发明通过可视化和报告对相似度和评分结果进行显示,提高了数据可视化效果,便于监测结果,方便老师更好为学生提供建议。(5) The present invention displays similarity and scoring results through visualization and reporting, which improves the data visualization effect, facilitates monitoring of results, and facilitates teachers to better provide suggestions to students.

附图说明Description of the drawings

附图用来提供对本发明的进一步理解,并且构成说明书的一部分,与本发明的实施例一起用于解释本发明,并不构成对本发明的限制。在附图中:The drawings are used to provide a further understanding of the present invention and constitute a part of the specification. They are used to explain the present invention together with the embodiments of the present invention and do not constitute a limitation of the present invention. In the attached picture:

图1是本发明基于潜在语义索引的作文跑题检测评分系统的结构示意图;Figure 1 is a schematic structural diagram of the essay off-topic detection and scoring system based on latent semantic indexing according to the present invention;

图2是本发明基于潜在语义索引的作文跑题检测评分方法的流程示意图。Figure 2 is a schematic flow chart of the essay off-topic detection and scoring method based on latent semantic indexing according to the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

请参阅图1,本发明提供技术方案:基于潜在语义索引的作文跑题检测评分系统,所述系统包括数据收集和预处理模块、术语矩阵模块、潜在语义索引模块、跑题检测模块、评分模块和输出模块,Please refer to Figure 1. The present invention provides a technical solution: an essay off-topic detection and scoring system based on latent semantic indexing. The system includes a data collection and preprocessing module, a term matrix module, a latent semantic index module, an off-topic detection module, a scoring module and an output module,

所述数据收集和预处理模块用于收集作文和其对应的题目,并对采集的文章进行预处理;The data collection and preprocessing module is used to collect compositions and their corresponding topics, and preprocess the collected articles;

所述术语矩阵模块用于建立术语-文档矩阵,对结果进行归一化;The term matrix module is used to establish a term-document matrix and normalize the results;

所述潜在语义索引模块用于根据潜在语义索引来对矩阵进行奇异值分解,并计算文章与范文之间的相似度;The latent semantic index module is used to perform singular value decomposition on the matrix based on the latent semantic index and calculate the similarity between the article and the sample article;

所述跑题检测模块用于使用TF-IDF*算法和潜在语义索引模块的相似度结果,综合对文章是否跑题进行检测;The off-topic detection module is used to use the similarity results of the TF-IDF* algorithm and the latent semantic index module to comprehensively detect whether the article is off-topic;

所述评分模块用于根据检测结果和预设的评分标准进行对比,给出作文的分数;The scoring module is used to compare the test results with the preset scoring standards and give the score of the composition;

所述输出模块用于向用户输出文章跑题检测结果和文章的评分结果。The output module is used to output the article off-topic detection results and the article scoring results to the user.

本发明使用潜在语义索引来降低术语-文档矩阵的维数,提高了文本处理的效率和效果,可识别出传统向量空间模型中不明显的术语和文档之间的关系,提高了识别度,通过识别和利用文档集合中的底层结构来提高检索的准确性,可以处理同义词和多义词,使用TF-IDF*算法提高文章跑题检测的准确性,可以更好的判断文章是否跑题,优化了关键词的提取。This invention uses latent semantic indexing to reduce the dimensionality of the term-document matrix, improves the efficiency and effect of text processing, can identify the relationship between terms and documents that are not obvious in the traditional vector space model, and improves the recognition degree. Identify and utilize the underlying structure in the document collection to improve the accuracy of retrieval. It can handle synonyms and polysemy. It uses the TF-IDF* algorithm to improve the accuracy of article off-topic detection. It can better judge whether the article is off-topic and optimize the keywords. extract.

所述数据收集和预处理模块包括数据采集单元和预处理单元,所述数据采集单元由数据抓取单元和数据存储单元组成,所述数据抓取单元用于对作文和作文的题目进行抓取收集,所述数据存储单元用于将数据抓取单元获取的数据进行存储,所述预处理单元由数据清洗单元和英文转换单元组成,所述数据清洗单元用于将文章的停用词和标点符号进行删除,所述英文转换单元用于将文章中的所有英文单词转换为小写英文。The data collection and preprocessing module includes a data collection unit and a preprocessing unit. The data collection unit is composed of a data capture unit and a data storage unit. The data capture unit is used to capture compositions and composition topics. Collect, the data storage unit is used to store the data obtained by the data capture unit, the pre-processing unit is composed of a data cleaning unit and an English conversion unit, the data cleaning unit is used to convert the stop words and punctuation of the article symbols are deleted, and the English conversion unit is used to convert all English words in the article into lowercase English.

本发明通过对文章和其对应的题目进行收集,可以更好的匹配文章和题目的关联性,将数据存储于数据存储单元,可以让预处理更加的便捷,通过对文章进行预处理,可以减少冗余信息,提高文本的处理效率,删除标点符号可以使文本更加的规范化,易于处理,提高结果的准确性,将英文转化为小写,消除了大小写的干扰,提高数据的一致性,使文本格式更加统一规整,使数据处理更加的高效方便。By collecting articles and their corresponding topics, the present invention can better match the relevance of articles and topics, and store data in a data storage unit, which can make preprocessing more convenient. By preprocessing articles, it can reduce Redundant information improves the efficiency of text processing. Removing punctuation marks can make the text more standardized and easier to process, improving the accuracy of the results. Converting English to lowercase eliminates the interference of uppercase and lowercase letters, improves the consistency of data, and makes the text The format is more unified and regular, making data processing more efficient and convenient.

所述术语矩阵模块包括矩阵创立单元和归一化单元,所述矩阵创立单元用于创立术语-文档矩阵,所述归一化单元用于将矩阵的表示方法进行归一化处理,以便于进行相似度计算。The term matrix module includes a matrix creation unit and a normalization unit. The matrix creation unit is used to create a term-document matrix. The normalization unit is used to normalize the representation method of the matrix to facilitate processing. Similarity calculation.

本发明通过创立术语-文档矩阵,将矩阵进行归一化处理,方便相似度计算,将文档表示为向量,提高计算效率。The present invention creates a term-document matrix, normalizes the matrix, facilitates similarity calculation, expresses documents as vectors, and improves calculation efficiency.

所述潜在语义索引模块包括奇异值分解单元和相似度检测单元,所述奇异值分解单元用于对矩阵进行奇异值分解,降低数据的维度,所述相似度检测单元用于使用余弦相似度来计算文章和范文之间的相似度。The latent semantic index module includes a singular value decomposition unit and a similarity detection unit. The singular value decomposition unit is used to perform singular value decomposition on the matrix to reduce the dimension of the data. The similarity detection unit is used to use cosine similarity to Calculate the similarity between the article and the sample article.

本发明对术语-文档矩阵进行奇异值分解,降低了数据维度,提高文本处理的效率和效果,通过识别和利用集合中的底层结构来提高检索准确性,使用余弦相似度来计算文章和题目的贴合度,不需要大量矩阵计算,提高了计算速度,具有更高的精确度,并且适用于高维度向量。This invention performs singular value decomposition on the term-document matrix, reduces the data dimension, improves the efficiency and effect of text processing, improves retrieval accuracy by identifying and utilizing the underlying structure in the collection, and uses cosine similarity to calculate the number of articles and topics. Good fit, does not require a lot of matrix calculations, improves calculation speed, has higher accuracy, and is suitable for high-dimensional vectors.

所述跑题检测模块包括TF-IDF*算法单元和检测单元,所述TF-IDF*算法单元用于对文章是否跑题进行计算,所述检测单元用于综合考虑TF-IDF*算法和潜在语义索引的相似度计算结果,对文章是否跑题进行检测。The off-topic detection module includes a TF-IDF* algorithm unit and a detection unit. The TF-IDF* algorithm unit is used to calculate whether the article is off-topic. The detection unit is used to comprehensively consider the TF-IDF* algorithm and latent semantic index. The similarity calculation results are used to detect whether the article is off-topic.

本发明使用了TF-IDF*算法,通过对原始TF-IDF算法进行改进,综合考虑TF-IDF*算法和潜在语义索引的相似度结果,对文章是否跑题进行检测,提高了文本相似性计算的准确性,优化了关键词的抽取,可以得到更加准确的检测结果。The present invention uses the TF-IDF* algorithm, improves the original TF-IDF algorithm, comprehensively considers the similarity results of the TF-IDF* algorithm and the latent semantic index, detects whether the article is off-topic, and improves the efficiency of text similarity calculation. Accuracy, optimized keyword extraction, and more accurate detection results can be obtained.

所述评分模块包括相似度对比单元和评分单元,所述相似度对比单元用于将文章相似度和系统标准进行比较,划分出不同的等级,所述评分单元用于根据相似度对比单元的等级结果和评分标准进行评分。The scoring module includes a similarity comparison unit and a scoring unit. The similarity comparison unit is used to compare the similarity of articles with system standards and divide them into different grades. The scoring unit is used to compare the grades of units based on similarity. Results and scoring criteria are scored.

本发明通过将文章相似度和系统标准进行比较,划分出不同等级,并将等级与评分标准进行对比,得出文章的评分,不需要复杂的计算,评分方法更加简单明了,可以快速得到评估结果。The present invention divides different grades by comparing the similarity of the article with the system standards, and compares the grades with the scoring standards to obtain the score of the article. No complicated calculations are required. The scoring method is simpler and clearer, and the evaluation results can be obtained quickly. .

所述输出模块包括可视化单元和报告生成器,所述可视化单元用于将文章的相似度和评分结果转化为图表形式,所述报告生成器用于将文章相似度和评分结果生成报告。The output module includes a visualization unit and a report generator, the visualization unit is used to convert article similarity and scoring results into chart form, and the report generator is used to generate a report from the article similarity and scoring results.

本发明通过可视化将文章的相似度和评分结果转化为图表形式,更加的直观清楚,将文章相似度和评分结果生成报告,提高了数据可视化效果,便于监测结果,可以方便老师更好的为学生提供建议。The present invention converts article similarity and scoring results into chart form through visualization, which is more intuitive and clear. It generates reports on article similarity and scoring results, improves the data visualization effect, facilitates monitoring of results, and facilitates teachers to provide better services to students. Give suggestion.

请参阅图2,基于潜在语义索引的作文跑题检测方法,所述方法包括以下步骤:Please refer to Figure 2, an essay off-topic detection method based on latent semantic indexing. The method includes the following steps:

步骤S100:数据采集单元对文章和文章对应的题目进行收集;Step S100: The data collection unit collects articles and topics corresponding to the articles;

步骤S200:对文章进行预处理,删除停用词和标点符号,并将英文转换为小写;Step S200: Preprocess the article, delete stop words and punctuation marks, and convert English to lowercase;

步骤S300:创立术语-文档矩阵,每一行对应术语,每一列对应文档;Step S300: Create a term-document matrix, with each row corresponding to a term and each column corresponding to a document;

令A_m×n为术语-文档矩阵,其行对应于术语,其列对应于文档。令d为A的ih列向量,i=1,2,…,n。为了比较文档i和文档j之间的相似性,我们只需执行以下操作:Let A_m×n be a term-document matrix whose rows correspond to terms and whose columns correspond to documents. Let d be the ih column vector of A, i=1,2,…,n. To compare the similarity between document i and document j, we simply do the following:

D·d,或矩阵表示法d d,为了对结果进行归一化,我们将结果除以它们的范数,并将这个归一化的结果称为余弦相似度。cos(θ)=d·d/‖d‖‖d‖其中-1≤cos(θ)≤1θ是矢量d和矢量d之间的夹角,余弦相似度值越接近1,说明文档i和文档j的相似度越高,D·d, or d in matrix notation, to normalize the results, we divide the results by their norm and call this normalized result cosine similarity. cos(θ)=d·d/‖d‖‖d‖where -1≤cos(θ)≤1θ is the angle between vector d and vector d. The closer the cosine similarity value is to 1, it means that document i and document The higher the similarity of j, the

步骤S400:潜在语义索引模块对矩阵进行奇异值分解,并进行余弦相似度计算;Step S400: The latent semantic index module performs singular value decomposition on the matrix and performs cosine similarity calculation;

所述奇异值分解方法:The singular value decomposition method:

通过奇异值分解(SVD),A可以分解为:A_m×n=U_m×m·Σ_m×n·V_n×n;Through singular value decomposition (SVD), A can be decomposed into: A_m×n=U_m×m·Σ_m×n·V_n×n;

U=[u_1,u_2,···,u_m],Σ=diag{σ_1,σ_2,···,σ_p},V=[v_1,U=[u_1,u_2,···,u_m], Σ=diag{σ_1, σ_2,···,σ_p}, V=[v_1,

v_2,···,v_n],其中U是AA的正交矩阵,满足AA=U∧1U,其中∧1是对角线,V是AA的正交矩阵,满足AA=V∧2V,其中∧2是对角线,Σ是一个对角矩阵,它包含AA或A A的特征值的平方根,在其对角线上按降序排列,v_2,···,v_n], where U is the orthogonal matrix of AA, satisfying AA=U∧ 1 U, where ∧ 1 is the diagonal, V is the orthogonal matrix of AA, satisfying AA=V∧ 2 V, where ∧ 2 is the diagonal, Σ is a diagonal matrix containing AA or the square roots of the eigenvalues of AA in descending order on its diagonal,

A的奇异值分解也可以写成更紧凑的形式如下:The singular value decomposition of A can also be written in a more compact form as follows:

A_m×n=U_m×p,Σ_p×p,V_p×n,U=[u_1,u_2,···,u_p],Σ=diag{σ_1,σ_2,···,σ_p},V=[v_1,v_2,···,v_p],p=最小值(m,n),A_m×n=U_m×p, Σ_p×p, V_p×n, U=[u_1, u_2,···,u_p], Σ=diag{σ_1,σ_2,···,σ_p}, V=[v_1, v_2,···,v_p], p=minimum value (m, n),

接下来,从Σ中选择k个最大的σ(σ_1,σ_2,···,σ_k),以及它们从U和V中对应的向量u(u_1,u_2,···,u_k)和v(v_1,v_2,···,v_k)以得到对A的秩k近似最小误差(Frobenius范数)将此近似值写为:A*_m×n=U_m×k·Σ*_k×k·V_k×n,U=[u_1,u_2,···,u_k],Σ*=diag{σ_1,σ_2,···,σ_k},V=[v_1,v_2,···,v_k],k≤min(m,n)=p,Next, select the k largest σ(σ_1,σ_2,···,σ_k) from Σ, and their corresponding vectors u(u_1,u_2,···,u_k) and v(v_1 from U and V , v_2,···,v_k) to obtain the minimum error (Frobenius norm) of the rank k approximation of A. Write this approximation as: A*_m×n=U_m×k·Σ*_k×k·V_k×n, U=[u_1,u_2,···,u_k], Σ*=diag{σ_1,σ_2,···,σ_k}, V=[v_1, v_2,···,v_k], k≤min(m, n)=p,

不使用A A中包含的结果,即d ·d,而是使用A*A*中包含的近似结果,即d*·d*,/>来比较文档之间的相似度i和文档j,d是A的第ih列向量,i=1,…,n和d*是A*的第ih列向量,i=1,…,n,(Σ*_k×kw)·(Σ*_k×kw),或者用矩阵表示法(Σ*_k×kw)(Σ*_k×kw),Do not use the result contained in AA, i.e. d·d, Instead, use the approximate result contained in A*A*, which is d*·d*,/> To compare the similarity between documents i and document j, d is the ih-th column vector of A, i=1,…,n and d* is the ih-th column vector of A*, i=1,…,n, ( Σ*_k×kw)·(Σ*_k×kw), or using matrix notation (Σ*_k×kw)(Σ*_k×kw),

其中w是V的ih列向量,where w is the ih column vector of V,

A*A*=(U_m×m·Σ*_m×n·V_n×n)·(U_m×m·Σ*_m×n·V_n×n),A*A*=(U_m×m·Σ*_m×n·V_n×n)·(U_m×m·Σ*_m×n·V_n×n),

Σ*_m×n=diag{σ_1,σ_2,···,σ_k},Σ*_k×k=diag{σ_1,σ_2,···,σ_k},Σ*_m×n=diag{σ_1,σ_2,···,σ_k}, Σ*_k×k=diag{σ_1,σ_2,···,σ_k},

所述相似度计算方法:The similarity calculation method:

q*=U_m×kq,为了比较新文档查询和文档i之间的相似度,计算q*和Σ*_k×kwq*=U_m×kq, in order to compare the similarity between the new document query and document i, calculate q* and Σ*_k×kw

之间的余弦相似度,如下所示:cos(θ)=q*·(Σ*_k×kw)/‖q*‖‖Σ*_k×kw ‖其中-1≤cos(θ)≤1,θ是向量q*和向量Σ*_k×kw的夹角,The cosine similarity between them is as follows: cos(θ)=q*·(Σ*_k×kw)/‖q*‖‖Σ*_k×kw ‖where -1≤cos(θ)≤1, θ is the angle between vector q* and vector Σ*_k×kw,

步骤S500:跑题检测模块使用TF-IDF*算法和潜在语义索引模块的相似度结果,综合对文章是否跑题进行检测;Step S500: The off-topic detection module uses the TF-IDF* algorithm and the similarity results of the latent semantic index module to comprehensively detect whether the article is off-topic;

TF-IDF*算法由传统的TF-IDF算法改进而来,The TF-IDF* algorithm is improved from the traditional TF-IDF algorithm.

传统的TF-IDF算法公式为:TF-IDF=TF·IDF,The traditional TF-IDF algorithm formula is: TF-IDF=TF·IDF,

其中ni,j表示词条ti在文档dj中出现的次数,TFi,j表示词条ti在文档dj中出现的频率,where n i,j represents the number of times the term ti appears in the document dj, TFi,j represents the frequency of the term ti appearing in the document dj,

其中|D|表示所有文档的数量,|j:ti∈dj|表示包含词条ti的文档数量,where |D| represents the number of all documents, |j:ti∈dj| represents the number of documents containing the term ti,

改进后的TF-IDF*算法公式为: The improved TF-IDF* algorithm formula is:

其中TF表示词条在文本中出现的频率,N表示语料库中的文档总数,IDF表示包含词条的文档数,L表示文档的平均长度,I表示词条的长度,Among them, TF represents the frequency of the term appearing in the text, N represents the total number of documents in the corpus, IDF represents the number of documents containing the term, L represents the average length of the document, and I represents the length of the term.

改进:Improve:

(1)对TF取对数,以平滑文章高频词的权重;(1) Take the logarithm of TF to smooth the weight of high-frequency words in the article;

(2)对IDF加上一个常数0.5,以减少文章低频词的权重;(2) Add a constant 0.5 to the IDF to reduce the weight of low-frequency words in the article;

公式原理:Formula principle:

(1)一个词在文档中出现的次数越多,说明它越能反映文档的主题,所以TF是正相关的;(1) The more times a word appears in a document, the more it reflects the topic of the document, so TF is positively correlated;

(2)一个词在语料库中出现的文档越少,说明它越能区分不同的文档,所以IDF是负相关的;(2) The fewer documents a word appears in the corpus, the better it can distinguish between different documents, so IDF is negatively correlated;

(3)一个词条的长度越长,说明它越具有特征性,所以词条长度是正相关的;(3) The longer the length of an entry, the more characteristic it is, so the length of the entry is positively correlated;

(4)一个文档的长度越长,说明它包含的信息越多,所以文档长度是负相关的;(4) The longer a document is, the more information it contains, so document length is negatively correlated;

TF-IDF*的计算结果绝对值越大,表示重要程度越高,更能够反映主题,即更贴合主题,绝对值越小,表示重要程度越低,不能够反映主题,即跑题,The larger the absolute value of the calculation result of TF-IDF*, the higher the importance, and the better it can reflect the topic, that is, it is more relevant to the topic. The smaller the absolute value, the lower the importance, which cannot reflect the topic, that is, it is off-topic.

跑题检测方法:Off-topic detection method:

P=|0.4cosθ|+0.6|(TF-IDF*)|,P=|0.4cosθ|+0.6|(TF-IDF*)|,

其中P表示文章的跑题检测指标,cosθ表示余弦相似度,TF-IDF*表示词频逆文档频率,当P≥a时,文章没有跑题,当P<a时,文章判断为跑题,a为系统预设的阈值,Among them, P represents the off-topic detection index of the article, cosθ represents the cosine similarity, TF-IDF* represents the inverse document frequency of word frequency. When P≥a, the article is not off-topic. When P<a, the article is judged to be off-topic. a is the system predetermined Set the threshold value,

实施例:Example:

对文章进行术语-文档矩阵创立,对矩阵进行奇异值分解,并使用余弦相似度进行计算的得出cosθ为0.4,文章中词条出现的频率TF为0.2,语料库中文档总数N为3,包含词条的文档数IDF为2,文档的平均长度L为6,词条的长度I为2,Create a term-document matrix for the article, perform singular value decomposition on the matrix, and use cosine similarity to calculate it. The result is that cosθ is 0.4, the frequency of term entries in the article TF is 0.2, and the total number of documents in the corpus N is 3, including The number of documents of the term IDF is 2, the average length of the document L is 6, and the length of the term I is 2.

将数据代入 文章的跑题指标P=|0.4*0.4|+0.6*|-0.58|=0.5,系统中预设的阈值a为0.4,将文章跑题指标和阈值进行比较,P>a,判断文章没有跑题,Substitute data into The article's off-topic indicator P=|0.4*0.4|+0.6*|-0.58|=0.5. The preset threshold a in the system is 0.4. Compare the article's off-topic indicator and the threshold. P>a indicates that the article is not off-topic.

步骤S600:评分模块根据相似度对应的标准进行评分;Step S600: The scoring module scores according to the standard corresponding to the similarity;

步骤S700:将文章的相似度和评分结果转化为图表进行显示,并生成报告。Step S700: Convert the similarity and scoring results of the article into charts for display, and generate a report.

需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that these entities or operations are mutually exclusive. any such actual relationship or sequence exists between them. Furthermore, the terms "comprises," "comprises," or any other variations thereof are intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus that includes a list of elements includes not only those elements, but also those not expressly listed other elements, or elements inherent to the process, method, article or equipment.

最后应说明的是:以上所述仅为本发明的优选实施例而已,并不用于限制本发明,尽管参照前述实施例对本发明进行了详细的说明,对于本领域的技术人员来说,其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。Finally, it should be noted that the above are only preferred embodiments of the present invention and are not intended to limit the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, for those skilled in the art, it is still The technical solutions described in the foregoing embodiments may be modified, or some of the technical features may be equivalently replaced. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection scope of the present invention.

Claims (10)

1. The composition running question detection scoring system based on the potential semantic index is characterized by comprising a data collection and preprocessing module, a term matrix module, a potential semantic index module, a running question detection module, a scoring module and an output module,
the data collection and preprocessing module is used for collecting compositions and corresponding topics thereof and preprocessing the collected articles;
the term matrix module is used for establishing a term-document matrix and normalizing the result;
the latent semantic indexing module is used for carrying out singular value decomposition on the matrix according to the latent semantic indexes and calculating the similarity between the article and the template;
the running question detection module is used for comprehensively detecting whether the article runs questions or not by using the similarity results of the TF-IDF algorithm and the potential semantic indexing module;
the scoring module is used for comparing according to the detection result and a preset scoring standard and giving the score of the composition;
and the output module is used for outputting the article running question detection result and the scoring result of the article to the user.
2. The underlying semantic indexing based composition running question detection scoring system of claim 1, wherein: the data collection and preprocessing module comprises a data collection unit and a preprocessing unit, wherein the data collection unit consists of a data grabbing unit and a data storage unit, the data grabbing unit is used for grabbing and collecting the titles of the compositions and the works, the data storage unit is used for storing the data acquired by the data grabbing unit, the preprocessing unit consists of a data cleaning unit and an English conversion unit, the data cleaning unit is used for deleting the stop words and punctuation marks of the articles, and the English conversion unit is used for converting all English words in the articles into lower-case English.
3. The underlying semantic indexing based composition running question detection scoring system of claim 1, wherein: the term matrix module comprises a matrix creation unit and a normalization unit, wherein the matrix creation unit is used for creating a term-document matrix, and the normalization unit is used for performing normalization processing on a representation method of the matrix so as to perform similarity calculation.
4. The underlying semantic indexing based composition running question detection scoring system of claim 1, wherein: the latent semantic indexing module comprises a singular value decomposition unit and a similarity detection unit, wherein the singular value decomposition unit is used for performing singular value decomposition on the matrix and reducing the dimension of data, and the similarity detection unit is used for calculating the similarity between the article and the norm by using cosine similarity.
5. The underlying semantic indexing based composition running question detection scoring system of claim 1, wherein: the running problem detection module comprises a TF-IDF algorithm unit and a detection unit, wherein the TF-IDF algorithm unit is used for calculating whether the article runs problems or not, and the detection unit is used for detecting whether the article runs problems or not by comprehensively considering similarity calculation results of the TF-IDF algorithm and potential semantic indexes.
6. The underlying semantic indexing based composition running question detection scoring system of claim 1, wherein: the scoring module comprises a similarity comparison unit and a scoring unit, wherein the similarity comparison unit is used for comparing the article similarity with the system standard and dividing different grades, and the scoring unit is used for scoring according to the grade result and the scoring standard of the similarity comparison unit.
7. The underlying semantic indexing based composition running question detection scoring system of claim 1, wherein: the output module comprises a visualization unit and a report generator, wherein the visualization unit is used for converting the similarity and the scoring result of the articles into a chart form, and the report generator is used for generating a report of the similarity and the scoring result of the articles.
8. A composition running question detection method based on latent semantic index applied to the composition running question detection scoring system based on latent semantic index as claimed in any one of claims 1-7, characterized in that: the method comprises the following steps:
step S100: the data acquisition unit is used for collecting the articles and the topics corresponding to the articles;
step S200: preprocessing the article, deleting stop words and punctuation marks, and converting English into lowercase;
step S300: creating a term-document matrix, wherein each row corresponds to a term, and each column corresponds to a document;
step S400: the latent semantic indexing module carries out singular value decomposition on the matrix and carries out cosine similarity calculation;
step S500: the running question detection module uses the similarity result of the TF-IDF algorithm and the potential semantic indexing module to comprehensively detect whether the article runs questions or not;
step S600: the scoring module scores according to the corresponding standards of the similarity;
step S700: and converting the similarity and scoring results of the articles into charts for display, and generating reports.
9. The method for detecting and scoring a composition running question based on potential semantic indexes as claimed in claim 8, wherein: the TF-IDF algorithm in step S500 is modified from the conventional TF-IDF algorithm,
the improved TF-IDF algorithm formula is:
where TF represents the frequency of occurrence of the term in the text, N represents the total number of documents in the corpus, IDF represents the number of documents containing the term, L represents the average length of the documents, and I represents the length of the term.
10. The method for detecting and scoring a composition running question based on potential semantic indexes as claimed in claim 8, wherein: the running problem detection method in the step S500 is as follows:
P=|0.4cosθ|+0.6|(TF-IDF*)|,
wherein P represents the running problem detection index of the article, cos theta represents cosine similarity, TF-IDF represents word frequency inverse document frequency, when P is more than or equal to a, the article does not run problems, when P is less than a, the article judges that the article runs problems, and a is a threshold preset by a system.
CN202310674533.7A 2023-06-08 2023-06-08 Essay off-topic detection and scoring system and method based on latent semantic indexing Pending CN116719902A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310674533.7A CN116719902A (en) 2023-06-08 2023-06-08 Essay off-topic detection and scoring system and method based on latent semantic indexing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310674533.7A CN116719902A (en) 2023-06-08 2023-06-08 Essay off-topic detection and scoring system and method based on latent semantic indexing

Publications (1)

Publication Number Publication Date
CN116719902A true CN116719902A (en) 2023-09-08

Family

ID=87872791

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310674533.7A Pending CN116719902A (en) 2023-06-08 2023-06-08 Essay off-topic detection and scoring system and method based on latent semantic indexing

Country Status (1)

Country Link
CN (1) CN116719902A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119180272A (en) * 2024-09-06 2024-12-24 北京市计算中心有限公司 Government text quality evaluation method based on semantic analysis and big data index

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251841A (en) * 2007-05-17 2008-08-27 华东师范大学 Establishment and Retrieval Method of Feature Matrix of Web Documents Based on Semantics
US20090190839A1 (en) * 2008-01-29 2009-07-30 Higgins Derrick C System and method for handling the confounding effect of document length on vector-based similarity scores
CN106126613A (en) * 2016-06-22 2016-11-16 苏州大学 One composition of digressing from the subject determines method and device
CN106570196A (en) * 2016-11-18 2017-04-19 广州视源电子科技股份有限公司 Video program searching method and device
CN107577799A (en) * 2017-09-21 2018-01-12 合肥集知网知识产权运营有限公司 A kind of big data patent retrieval method based on potential applications retrieval model
CN110287291A (en) * 2019-07-03 2019-09-27 桂林电子科技大学 An Unsupervised Method for Sentence Off-topic Analysis of English Short Texts

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251841A (en) * 2007-05-17 2008-08-27 华东师范大学 Establishment and Retrieval Method of Feature Matrix of Web Documents Based on Semantics
US20090190839A1 (en) * 2008-01-29 2009-07-30 Higgins Derrick C System and method for handling the confounding effect of document length on vector-based similarity scores
CN106126613A (en) * 2016-06-22 2016-11-16 苏州大学 One composition of digressing from the subject determines method and device
CN106570196A (en) * 2016-11-18 2017-04-19 广州视源电子科技股份有限公司 Video program searching method and device
CN107577799A (en) * 2017-09-21 2018-01-12 合肥集知网知识产权运营有限公司 A kind of big data patent retrieval method based on potential applications retrieval model
CN110287291A (en) * 2019-07-03 2019-09-27 桂林电子科技大学 An Unsupervised Method for Sentence Off-topic Analysis of English Short Texts

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
许浩等: "基于余弦文本相似度计算的英语作文评分算法的应用研究", 教育教学论坛, no. 06, 7 February 2018 (2018-02-07) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119180272A (en) * 2024-09-06 2024-12-24 北京市计算中心有限公司 Government text quality evaluation method based on semantic analysis and big data index

Similar Documents

Publication Publication Date Title
Lahitani et al. Cosine similarity to determine similarity measure: Study case in online essay assessment
CN109299865B (en) Psychological evaluation system and method based on semantic analysis and information data processing terminal
CN117076693A (en) Method for constructing digital human teacher multi-mode large language model pre-training discipline corpus
CN107239439A (en) Public sentiment sentiment classification method based on word2vec
CN107220295A (en) A kind of people&#39;s contradiction reconciles case retrieval and mediation strategy recommends method
CN111143672B (en) A professional specialty scholar recommendation method based on knowledge graph
US20180341686A1 (en) System and method for data search based on top-to-bottom similarity analysis
CN104794169A (en) Subject term extraction method and system based on sequence labeling model
CN113641788B (en) Unsupervised long and short film evaluation fine granularity viewpoint mining method
CN118170933A (en) A method and device for constructing multimodal corpus data in scientific fields
CN107357765A (en) Word document flaking method and device
CN112417893A (en) Software function demand classification method and system based on semantic hierarchical clustering
CN113516094A (en) A system and method for matching review experts for documents
CN116719902A (en) Essay off-topic detection and scoring system and method based on latent semantic indexing
CN115221871B (en) Multi-feature fusion method for extracting keywords from English scientific literature
CN119128146B (en) A dynamic structured data classification method, device, system and storage medium
CN119025627B (en) Knowledge base construction method, equipment and medium for children neural development type problem
Mulyanto et al. Systematic literature review of text feature extraction
Asthana et al. VADER: A Lightweight and Effective Approach for Sentiment Analysis
CN119293663A (en) A method and system for grading and handling power abnormal data risks
Aliyanto et al. Supervised probabilistic latent semantic analysis (sPLSA) for estimating technology readiness level
Kulkarni et al. Digitization of Physical Notes: A Comprehensive Approach Using OCR, CNN, RNN, and NMF
Tayal et al. Automated Exam Paper Checking Using Semantic Analysis
El Rhezzali et al. NLP-Enhanced Techniques for Cheating Detection in Virtual Exams: A Comparative Study of String and Semantic Similarity Measures with K-Shingling, Minhashing, LSH, and K-Means.
Espinoza et al. Analyzing teacher talk using topics inferred by unsupervised modeling from textbooks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination