CN111325001A

CN111325001A - Paper identification, identification model training method, device, equipment and storage medium

Info

Publication number: CN111325001A
Application number: CN201811528227.8A
Authority: CN
Inventors: 王怡然; 陈巍
Original assignee: Pku Founder Information Industry Group Co ltd; Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Current assignee: New Founder Holdings Development Co ltd; Beijing Founder Electronics Co Ltd
Priority date: 2018-12-13
Filing date: 2018-12-13
Publication date: 2020-06-23
Anticipated expiration: 2038-12-13
Also published as: CN111325001B

Abstract

The paper identification, identification model training method, device, equipment and storage medium provided by the present disclosure include obtaining a paper to be identified; determining a paragraph identification corresponding to the to-be-identified paper according to a preset identification model; wherein the preset identification model is It is pre-trained on the training set of the paper. In the solution provided by the present disclosure, a model for representing a paper can be obtained by training a model using a paper training set provided with a paragraph preset identifier, so that the identifier corresponding to each paragraph in the paper can be identified based on the preset recognition model, which solves the problem of the prior art. The problem of needing to rely on manual editing of the format of the paper.

Description

Paper identification, identification model training method, device, equipment and storage medium

技术领域technical field

本公开涉及文本处理技术，尤其涉及一种论文标识、标识模型训练方法、装置、设备及存储介质。The present disclosure relates to text processing technology, and in particular, to a paper identification, identification model training method, device, equipment and storage medium.

背景技术Background technique

目前，很多论文都会通过线上或线下的形式发表，从而使更多的用户能够阅览论文内容。在论文发表之前，需要对论文进行编校、格式化，再对论文进行出版。At present, many papers are published online or offline, so that more users can read the content of the papers. Before a paper is published, it needs to be edited, formatted, and published.

现有技术中，为了使论文的格式统一，便于识别各个段落对应的内容，会要求撰写论文的作者以规定的格式写论文。但是，这种方式无法保证所有的作者都以该格式撰写论文，这就需要在论文出版前，再对论文的格式进行编辑。In the prior art, in order to unify the format of the thesis and facilitate the identification of the content corresponding to each paragraph, the author of the thesis is required to write the thesis in a prescribed format. However, this method cannot guarantee that all authors will write their papers in this format, which requires editing the format of the paper before publication.

因此，现有技术中的出版方式无法完全的自动化，需要依赖于人工对论文的格式进行编辑。Therefore, the publishing methods in the prior art cannot be completely automated, and the format of the paper needs to be edited manually.

发明内容SUMMARY OF THE INVENTION

本公开提供一种论文标识、标识模型训练方法、装置、设备及存储介质，以解决现有技术中论文出版前需要依赖于人工对论文的格式进行编辑的问题。The present disclosure provides a paper identification, an identification model training method, device, equipment and storage medium, so as to solve the problem in the prior art that the format of the paper needs to be edited manually before the paper is published.

本公开的第一个方面是提供一种论文标识方法，包括：A first aspect of the present disclosure is to provide a paper identification method, including:

获取待识别论文；Obtain the papers to be identified;

根据预设识别模型确定所述待识别论文对应的段落标识；Determine the paragraph identifier corresponding to the to-be-recognized paper according to the preset recognition model;

其中，所述预设识别模型是预先根据论文训练集训练得到的。Wherein, the preset recognition model is obtained by training according to the training set of the paper in advance.

本公开的第二个方面是提供一种预设识别模型的训练方法，包括：A second aspect of the present disclosure is to provide a method for training a preset recognition model, including:

获取论文训练集；Get the thesis training set;

根据所述论文训练集训练模型，得到预设识别模型；According to the training model of the paper training set, a preset recognition model is obtained;

其中，所述论文训练集包括多篇训练论文，所述训练论文的段落预设有标识。Wherein, the paper training set includes a plurality of training papers, and paragraphs of the training papers are preset with identifiers.

本公开的第三个方面是提供一种论文标识装置，包括：A third aspect of the present disclosure is to provide a dissertation identification device, comprising:

获取模块，用于获取待识别论文；The acquisition module is used to acquire the papers to be identified;

确定模块，用于根据预设识别模型确定所述待识别论文对应的段落标识；A determination module, configured to determine the paragraph identifier corresponding to the to-be-recognized paper according to a preset recognition model;

本公开的第四个方面是提供一种预设识别模型的训练装置，包括：A fourth aspect of the present disclosure is to provide a training device for a preset recognition model, including:

获取模块，用于获取论文训练集；The acquisition module is used to acquire the paper training set;

训练模块，用于根据所述论文训练集训练模型，得到预设识别模型；A training module for training a model according to the thesis training set to obtain a preset recognition model;

本公开的第五个方面是提供一种论文标识设备，包括：A fifth aspect of the present disclosure is to provide a paper identification device, comprising:

存储器；memory;

处理器；以及processor; and

计算机程序；Computer program;

其中，所述计算机程序存储在所述存储器中，并配置为由所述处理器执行以实现如上述第一方面所述的论文标识方法。Wherein, the computer program is stored in the memory, and is configured to be executed by the processor to implement the method for identifying papers as described in the first aspect above.

本公开的第六个方面是提供一种预设识别模型的训练设备，包括：A sixth aspect of the present disclosure is to provide a training device for a preset recognition model, including:

存储器；memory;

处理器；以及processor; and

计算机程序；Computer program;

其中，所述计算机程序存储在所述存储器中，并配置为由所述处理器执行以实现如上述第二方面所述的预设识别模型的训练方法。Wherein, the computer program is stored in the memory and configured to be executed by the processor to implement the method for training a preset recognition model as described in the second aspect above.

本公开的第七个方面是提供一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行以实现如上述第一方面所述的论文标识方法。A seventh aspect of the present disclosure is to provide a computer-readable storage medium on which a computer program is stored, the computer program being executed by a processor to implement the paper identification method described in the first aspect above.

本公开的第八个方面是提供一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行以实现如上述第一方面所述的预设识别模型的训练方法。An eighth aspect of the present disclosure is to provide a computer-readable storage medium on which a computer program is stored, the computer program being executed by a processor to implement the method for training a preset recognition model as described in the first aspect above.

本公开提供的论文标识、标识模型训练方法、装置、设备及存储介质的技术效果是：The technical effects of the paper identification, identification model training method, device, equipment and storage medium provided by the present disclosure are:

本公开提供的论文标识、标识模型训练方法、装置、设备及存储介质，包括获取待识别论文；根据预设识别模型确定所述待识别论文对应的段落标识；其中，所述预设识别模型是预先根据论文训练集训练得到的。本公开提供的方案通过利用设置有段落预设标识的论文训练集训练模型，能够得到用于表示论文的模型，从而能够基于预设识别模型识别出论文中各个段落对应的标识，解决现有技术中需要依赖于人工对论文的格式进行编辑的问题。The dissertation identification, identification model training method, device, equipment and storage medium provided by the present disclosure include acquiring a dissertation to be identified; determining a paragraph identification corresponding to the dissertation to be identified according to a preset identification model; wherein the preset identification model is It is pre-trained on the training set of the paper. In the solution provided by the present disclosure, a model for representing a paper can be obtained by training a model on a paper training set provided with a paragraph preset identifier, so that the identifier corresponding to each paragraph in the paper can be identified based on the preset recognition model, which solves the problem of the prior art. The problem of needing to rely on manual editing of the format of the paper.

附图说明Description of drawings

图1为本发明一示例性实施例示出的论文标识方法的流程图；FIG. 1 is a flowchart of a method for identifying papers according to an exemplary embodiment of the present invention;

图2为本发明另一示例性实施例示出的论文标识方法的流程图；2 is a flowchart of a method for identifying a paper according to another exemplary embodiment of the present invention;

图3为本发明一示例性实施例示出的预设识别模型的训练方法的流程图；3 is a flowchart of a training method for a preset recognition model according to an exemplary embodiment of the present invention;

图4为本发明另一示例性实施例示出的预设识别模型的训练方法的流程图；4 is a flowchart of a method for training a preset recognition model according to another exemplary embodiment of the present invention;

图5为本发明一示例性实施例示出的论文标识装置的结构图；5 is a structural diagram of a dissertation identification device according to an exemplary embodiment of the present invention;

图6为本发明另一示例性实施例示出的论文标识装置的结构图；6 is a structural diagram of a dissertation identification device according to another exemplary embodiment of the present invention;

图7为本发明一示例性实施例示出的预设识别模型的训练装置的结构图；7 is a structural diagram of an apparatus for training a preset recognition model according to an exemplary embodiment of the present invention;

图8为本发明另一示例性实施例示出的预设识别模型的训练装置的结构图；8 is a structural diagram of an apparatus for training a preset recognition model according to another exemplary embodiment of the present invention;

图9为本发明一示例性实施例示出的论文标识设备的结构图；9 is a structural diagram of a dissertation identification device according to an exemplary embodiment of the present invention;

图10为本发明一示例性实施例示出的预设识别模型的训练设备的结构图。FIG. 10 is a structural diagram of a training device for a preset recognition model according to an exemplary embodiment of the present invention.

具体实施方式Detailed ways

作者在写论文时，可以按照预设的格式进行撰写，例如，先写“标题：XXX”，再写“作者：XXX”，以及“作者单位：XXX”等，若作者在写论文时写入了“标题：”、“作者：”以及“作者单位：”这些标识，就可以直接识别论文中的这些标识，从而确定这些段落对应的内容，为自动化校编、出版论文提供基础。但是，有些作者在撰写论文时，可能不会严格按照预设的格式进行撰写，例如，没有写“标题：”，而是直接写出标题的内容，在这种情况下，则无法准确的在论文中识别出标题的内容。When writing a paper, the author can write according to the preset format, for example, write "title: XXX" first, then write "author: XXX", and "author unit: XXX", etc. If the author writes With the identifiers of "Title:", "Author:" and "Author's Unit:", these identifiers in the paper can be directly identified, so as to determine the corresponding content of these paragraphs, and provide a basis for automatic editing and publishing of the paper. However, some authors may not strictly follow the preset format when writing their papers, for example, instead of writing "title:", they directly write the content of the title. Identify the content of the title in the paper.

基于此，本发明实施例提供一种论文标识方法以及预设识别模型的训练方法，基于论文训练集训练得到预设识别模型，该模型能够根据论文中各个段落的内容，识别出论文中各个段落对应的标识，即使作者没有在论文中写入标识，也能够基于模型识别出各个段落对应的标识，从而解决现有技术中需要依赖于人工对论文的格式进行编辑的问题。Based on this, an embodiment of the present invention provides a method for identifying a paper and a method for training a preset recognition model, and a preset recognition model is obtained by training based on the training set of the paper. The model can identify each paragraph in the paper according to the content of each paragraph in the paper. For the corresponding identification, even if the author does not write the identification in the paper, the identification corresponding to each paragraph can be identified based on the model, thereby solving the problem that the format of the paper needs to be edited manually in the prior art.

图1为本发明一示例性实施例示出的论文标识方法的流程图。FIG. 1 is a flowchart of a method for identifying papers according to an exemplary embodiment of the present invention.

如图1所示，本实施例提供的论文标识方法包括：As shown in FIG. 1 , the method for identifying papers provided by this embodiment includes:

步骤101，获取待识别论文。Step 101: Acquire the papers to be identified.

其中，本实施例提供的方法可以由具有计算功能的电子设备执行，如计算机。该电子设备用于识别论文中各个段落的标识。The method provided in this embodiment may be executed by an electronic device having a computing function, such as a computer. The electronic device is used to identify the identification of individual paragraphs in the paper.

具体的，电子设备中存储有预设识别模型，该识别模型可以是由其他设备训练得到的，也可以是执行本实施例提供的方法训练得到的，本实施例不对此进行限制。Specifically, a preset recognition model is stored in the electronic device, and the recognition model may be obtained by training other devices, or may be obtained by executing the method provided in this embodiment, which is not limited in this embodiment.

进一步的，可以设置有用于存储论文的数据库，该数据库可以设置在执行本实施例提供的方法的电子设备中，也可以设置在其他设备中。例如，可以将数据库设置在一台服务器中，该服务器与执行本实施例提供的方法的电子设备连接。用户可以通过用户端上传论文，服务器可以接收到用户上传的论文，并将论文存储在数据库中。电子设备可以访问该数据库，数据库也可以主动向电子设备推送论文，从而使电子设备获取待识别论文。电子设备在获取待识别论文时，可以采用先进先出的原则，能够先处理用户最早上传的论文。Further, a database for storing papers may be provided, and the database may be provided in the electronic device performing the method provided in this embodiment, or in other devices. For example, the database may be set in a server, and the server is connected with the electronic device that executes the method provided by this embodiment. Users can upload papers through the client, and the server can receive the papers uploaded by users and store them in the database. The electronic device can access the database, and the database can also actively push papers to the electronic device, so that the electronic device can obtain the papers to be identified. When the electronic device obtains the papers to be identified, the principle of first-in, first-out can be adopted, and the papers uploaded by the user can be processed first.

实际应用时，用户通过用户终端上传论文时，可以由执行本实施例提供的方法的电子设备直接接收，从而使电子设备能够获取到待识别论文，并直接对上传的论文进行识别。In practical applications, when a user uploads a paper through a user terminal, it can be directly received by the electronic device executing the method provided in this embodiment, so that the electronic device can obtain the paper to be identified and directly identify the uploaded paper.

步骤102，根据预设识别模型确定待识别论文对应的段落标识。Step 102: Determine the paragraph identifier corresponding to the to-be-recognized paper according to the preset recognition model.

其中，预设识别模型是预先根据论文训练集训练得到的。Among them, the preset recognition model is pre-trained according to the training set of the paper.

具体的，执行本实施例提供的方法的电子设备中，存储有预设识别模型。可以将待识别论文输入预设识别模型，再由预设识别模型输出标识结果。Specifically, a preset recognition model is stored in the electronic device that executes the method provided in this embodiment. The papers to be identified can be input into a preset identification model, and the identification results can be output by the preset identification model.

进一步的，在训练预设识别模型之前，可以采集大量的论文，作为论文训练集，这些论文中，可以有作者在撰写过程中标注的标识，也可以没有标识。可以预先确定这些论文中各个段落对应的标识，并将论文内容以及论文对应的段落标识输入模型，从而训练模型内部的权重值。Further, before training the preset recognition model, a large number of papers can be collected as a training set of papers. These papers may or may not have labels marked by the authors during the writing process. The identifier corresponding to each paragraph in these papers can be predetermined, and the content of the paper and the paragraph identifier corresponding to the paper can be input into the model, so as to train the weight value inside the model.

实际应用时，可以直接应用该训练完成的识别模型，使其输出待识别论文的标识。具体可以输出待识别论文中各个段落对应的标识，例如，第一段为标题，第二段为作者，第三段为作者单位等。In practical application, the trained recognition model can be directly applied to output the identification of the paper to be recognized. Specifically, the identifier corresponding to each paragraph in the to-be-recognized paper can be output, for example, the first paragraph is the title, the second paragraph is the author, and the third paragraph is the author's unit, etc.

其中，若作者在撰写过程中，写入了具体的标识，则模型可以根据这些已有的标识确定段落对应的标识，若段落没有标识，则可以根据模型中的权重计算各个段落对应的标识。Among them, if the author writes specific identifiers during the writing process, the model can determine the identifiers corresponding to the paragraphs according to these existing identifiers. If the paragraphs have no identifiers, the identifiers corresponding to each paragraph can be calculated according to the weights in the model.

本实施例提供的方法用于对待识别论文进行标识，该方法由设置有本实施例提供的方法的设备执行，该设备通常以硬件和/或软件的方式来实现。The method provided by this embodiment is used to identify the to-be-recognized paper, and the method is executed by a device provided with the method provided by this embodiment, and the device is usually implemented in hardware and/or software.

本实施例提供的论文标识方法，包括获取待识别论文；根据预设识别模型确定待识别论文对应的段落标识；其中，预设识别模型是预先根据论文训练集训练得到的。本实施例提供的方法，能够基于预设识别模型识别出论文中各个段落对应的标识，从而解决现有技术中需要依赖于人工对论文的格式进行编辑的问题。The method for identifying a paper provided by this embodiment includes acquiring a paper to be recognized; determining a paragraph identifier corresponding to the paper to be recognized according to a preset recognition model; wherein, the preset recognition model is pre-trained based on a paper training set. The method provided by this embodiment can identify the identifier corresponding to each paragraph in the paper based on the preset recognition model, thereby solving the problem that the format of the paper needs to be edited manually in the prior art.

图2为本发明另一示例性实施例示出的论文标识方法的流程图。FIG. 2 is a flowchart of a method for identifying papers according to another exemplary embodiment of the present invention.

如图2所示，本实施例提供的论文标识方法，包括：As shown in Figure 2, the method for identifying papers provided by this embodiment includes:

步骤201，获取待识别论文。Step 201: Acquire the papers to be identified.

步骤201与步骤101的具体原理和实现方式类似，此处不再赘述。The specific principles and implementation manners of step 201 and step 101 are similar, and are not repeated here.

步骤202，对待识别论文中包括的段落进行分词处理，得到各个段落包括的分词。Step 202: Perform word segmentation on the paragraphs included in the to-be-recognized paper to obtain word segmentations included in each paragraph.

本实施例提供的方法中，还可以不直接将待识别论文输入预设识别模型，而是对待识别论文进行分词处理，再由待识别模型基于得到的分词，确定各个段落对应的标识。In the method provided in this embodiment, the to-be-recognized paper may not be directly input into the preset recognition model, but the to-be-recognized paper may be subjected to word segmentation processing, and then the to-be-recognized model may determine the identifier corresponding to each paragraph based on the obtained word segmentation.

其中，可以预先设置分词算法，从而根据该分词算法对待识别论文包括的各个段落进行分词处理。中文分词(Chinese Word Segmentation)指的是将一个汉字序列切分成一个个单独的词。分词算法就是能够将连续的字序列按照一定的规范重新组合成词序列的过程。还可以设置分词词典，从而根据分词词典对论文中的段落进行分词处理。Wherein, a word segmentation algorithm can be preset, so that each paragraph included in the to-be-recognized paper is subjected to word segmentation processing according to the word segmentation algorithm. Chinese Word Segmentation refers to dividing a sequence of Chinese characters into individual words. The word segmentation algorithm is the process of recombining consecutive word sequences into word sequences according to certain specifications. You can also set a word segmentation dictionary, so as to perform word segmentation processing on the paragraphs in the paper according to the word segmentation dictionary.

具体的，论文中可能包括多个段落，可以对每个段落都进行分词处理，从而得到各个段落对应的分词。在分词处理过程中，还可以去除段落中的语气词，不具有实际含义的词，例如“啊、的”等，从而仅根据具有实际意义的词汇对段落进行标识，能够降低段落识别过程中的计算量。Specifically, the paper may include multiple paragraphs, and word segmentation can be performed on each paragraph to obtain the word segmentation corresponding to each paragraph. In the process of word segmentation, the modal particles in the paragraph and the words without actual meaning, such as "ah, de", etc., can also be removed, so that the paragraph is only identified according to the words with actual meaning, which can reduce the number of words in the paragraph recognition process. amount of calculation.

步骤203，根据预设识别模型、分词确定段落对应的标识。Step 203: Determine the identifier corresponding to the paragraph according to the preset recognition model and word segmentation.

进一步的，本实施例提供的方法中，预设识别模型可以基于段落中的分词确定段落对应的标识。例如，可以将各个段落中的分词作为一个分词组合，输入预设识别模型，使识别模型确定这个分词组合对应的段落标识，也就是分词组合所属的段落对应的段落标识。Further, in the method provided in this embodiment, the preset recognition model may determine the identifier corresponding to the paragraph based on the word segmentation in the paragraph. For example, the word segmentation in each paragraph can be used as a word segmentation combination and input into a preset recognition model, so that the recognition model can determine the paragraph identifier corresponding to the word segmentation combination, that is, the paragraph identifier corresponding to the paragraph to which the word segmentation combination belongs.

在另一种实施方式中，还可以设置预设词库，预设词库中设置有多个词汇，可以认为预设次库中的词汇属于同一类词汇。例如，可以是同义词或者近义词组成一个预设词库。In another embodiment, a preset thesaurus can also be set, and a plurality of words are set in the preset thesaurus, and it can be considered that the words in the preset sub-library belong to the same type of vocabulary. For example, it can be synonyms or synonyms to form a preset thesaurus.

其中，可以确定分词属于的预设词库，并确定每个段落包括的分词属于各个预设次库的频率。对于一个段落来说，能够确定出多个分词，每个分词都有对应的预设词库。因此，能够统计出段落包括的分词属于各个预设词库的频率，例如，段落中包括5个分词，分别属于词库A、A、B、C、C，则能够得到该段落对应的预设次库频率为A-0.4、B-0.2、C-0.4。Wherein, the preset thesaurus to which the word segmentation belongs can be determined, and the frequency of the word segmentation included in each paragraph belonging to each preset sub-library can be determined. For a paragraph, multiple word segments can be determined, and each word segment has a corresponding preset thesaurus. Therefore, it is possible to count the frequency that the segmented words included in the paragraph belong to each preset thesaurus. For example, if the paragraph includes 5 segmented words, which belong to the thesaurus A, A, B, C, and C, respectively, the preset corresponding to the paragraph can be obtained. The frequency of the secondary library is A-0.4, B-0.2, C-0.4.

具体的，根据频率确定各个段落对应的特征向量。可以直接将频率作为特征向量，如(0.4、0.2、0.4)。通常情况下，每个预设词库中包括词，不可能均出现在同一段落中，因此，为了更准确的通过特征向量表达段落意义，还可以将未出现的词库概率设置为0。例如，若预设次库还包括词库D，但是段落中的分词未落入词库D，则特征向量可以是(0.4、0.2、0.4、0)。Specifically, the feature vector corresponding to each paragraph is determined according to the frequency. The frequency can be directly used as an eigenvector, such as (0.4, 0.2, 0.4). Usually, each preset thesaurus contains words, and it is impossible for them to appear in the same paragraph. Therefore, in order to express the meaning of the paragraph more accurately through the feature vector, the probability of non-appearing thesaurus can also be set to 0. For example, if the preset secondary library also includes thesaurus D, but the segmented words in the paragraph do not fall into thesaurus D, the feature vector may be (0.4, 0.2, 0.4, 0).

进一步的，通过特征向量还替代整个论文的段落，可以降低识别模型的数据处理量。Further, the data processing amount of the recognition model can be reduced by replacing the paragraphs of the entire paper by the feature vector.

实际应用时，可以基于预设识别模型根据特征向量确定段落对应的标识。可以将特征向量输入预设识别模型，根据预设识别模型内部的权重值对特征向量进行计算，能够输出与特征向量对应的段落标识，也就是特征向量对应的段落的标识。例如，输出的段落标识可以是“论文内容”。In practical application, the identifier corresponding to the paragraph can be determined according to the feature vector based on the preset recognition model. The feature vector can be input into the preset recognition model, and the feature vector can be calculated according to the weight value inside the preset recognition model, and the paragraph identification corresponding to the feature vector can be output, that is, the identification of the paragraph corresponding to the feature vector. For example, the output paragraph identification may be "thesis content".

其中，本实施例提供的方法，能够将待识别论文进行分词处理，再基于得到的分词确定各个段落对应的特征向量，然后将特征向量输入到预设识别模型中，从而根据预设识别模型确定段落对应的标识，能够降低识别模型的计算量。Among them, the method provided in this embodiment can perform word segmentation on the papers to be identified, and then determine the feature vector corresponding to each paragraph based on the obtained word segmentation, and then input the feature vector into the preset recognition model, so as to determine according to the preset recognition model. The identification corresponding to the paragraph can reduce the calculation amount of the recognition model.

图3为本发明一示例性实施例示出的预设识别模型的训练方法的流程图。FIG. 3 is a flowchart of a method for training a preset recognition model according to an exemplary embodiment of the present invention.

如图3所示，本实施例提供的预设识别模型的训练方法包括：As shown in FIG. 3 , the training method of the preset recognition model provided by this embodiment includes:

步骤301，获取论文训练集。Step 301, obtaining a training set of papers.

其中，本实施例提供的方法可以由具有计算功能的电子设备执行，如计算机。该电子设备用于训练预设识别模型。The method provided in this embodiment may be executed by an electronic device having a computing function, such as a computer. The electronic device is used to train a preset recognition model.

具体的，该电子设备可以仅用于训练预设识别模型，还可以对待识别论文进行识别，即用于训练的电子设备与用于标识论文的电子设备可以相同，也可以不同。本实施例不对此进行限制。Specifically, the electronic device can be used only for training a preset recognition model, and can also recognize papers to be recognized, that is, the electronic device used for training and the electronic device used for identifying papers can be the same or different. This embodiment does not limit this.

进一步的，可以设置有用于存储论文训练集的数据库，该数据库可以设置在执行本实施例提供的方法的电子设备中，也可以设置在其他设备中。例如，可以将数据库设置在一台服务器中，该服务器与执行本实施例提供的方法的电子设备连接。可以预先获取这些论文，并确定论文中各个段落对应的段落标识。Further, a database for storing the training set of the thesis may be provided, and the database may be provided in the electronic device that executes the method provided in this embodiment, or may be provided in other devices. For example, the database may be set in a server, and the server is connected with the electronic device that executes the method provided by this embodiment. These papers can be acquired in advance, and the paragraph identifiers corresponding to each paragraph in the paper can be determined.

步骤302，根据论文训练集训练模型，得到预设识别模型。Step 302: Train the model according to the thesis training set to obtain a preset recognition model.

其中，论文训练集包括多篇训练论文，训练论文的段落预设有标识。Among them, the paper training set includes a plurality of training papers, and the paragraphs of the training papers are preset with identifiers.

具体的，可以基于预先准备的论文集对模型进行训练，得到预设识别模型。Specifically, the model can be trained based on a pre-prepared collection of papers to obtain a preset recognition model.

进一步的，可以初始化模型内部的权重值。并将论文集中包括的论文输入模型，模型可以根据当前的权重确定论文各个段落对应的标识，再将确定的结果与预先设置的段落标识比对，根据比对结果调整模型内部的权重值。Further, the weight values inside the model can be initialized. And input the papers included in the thesis into the model, the model can determine the corresponding identification of each paragraph of the paper according to the current weight, and then compare the determined result with the preset paragraph identification, and adjust the weight value inside the model according to the comparison result.

实际应用中，可以将论文逐一输入模型中，每输入一个论文，就能够对模型中的权重进行修正一次，通过多次的调整，可以得到一套准确的用于识别论文的模型。In practical applications, the papers can be input into the model one by one, and each time a paper is input, the weights in the model can be corrected once, and through multiple adjustments, a set of accurate models for identifying papers can be obtained.

其中，还可以将论文训练集中的一部分论文作为测试集。当对模型训练完毕后，可以将测试集中的论文输入该模型，并确定论文中各个段落对应的段落标识，将其与预先确定的标识进行比对，若正确率高于预设阈值，则可以认为模型是准确的，可以使用该模型对用户提供的论文进行标识。Among them, a part of the papers in the training set of papers can also be used as the test set. When the model is trained, the papers in the test set can be input into the model, and the paragraph identifiers corresponding to each paragraph in the paper can be determined and compared with the pre-determined identifiers. If the correct rate is higher than the preset threshold, the The model is considered accurate and can be used to identify user-provided papers.

具体的，测试集中的论文可以用于对模型进行训练，也可以仅用于测试。上述预设阈值可以根据需求进行设置。Specifically, the papers in the test set can be used for training the model, or only for testing. The above-mentioned preset thresholds can be set according to requirements.

本实施例提供的方法用于训练预设识别模型，该方法由设置有本实施例提供的方法的设备执行，该设备通常以硬件和/或软件的方式来实现。The method provided in this embodiment is used to train a preset recognition model, and the method is executed by a device provided with the method provided in this embodiment, and the device is usually implemented in hardware and/or software.

本实施例提供的预设识别模型的训练方法，包括获取论文训练集；根据论文训练集训练模型，得到预设识别模型；其中，论文训练集包括多篇训练论文，训练论文的段落预设有标识。本实施例提供的方法，通过利用设置有段落预设标识的论文训练集训练模型，能够得到用于表示论文的模型，从而能够基于预设识别模型识别出论文中各个段落对应的标识，解决现有技术中需要依赖于人工对论文的格式进行编辑的问题。The training method for a preset recognition model provided by this embodiment includes acquiring a training set of papers; training a model according to the training set of papers to obtain a preset recognition model; wherein, the training set of papers includes a plurality of training papers, and the paragraphs of the training papers are preset with logo. In the method provided in this embodiment, a model for representing a thesis can be obtained by training a model using the thesis training set provided with the preset identification of the paragraph, so that the identification corresponding to each paragraph in the thesis can be identified based on the preset identification model, so as to solve the problem of the current situation. There are issues in technology that rely on manual editing of the format of a paper.

图4为本发明另一示例性实施例示出的预设识别模型的训练方法的流程图。FIG. 4 is a flowchart of a method for training a preset recognition model according to another exemplary embodiment of the present invention.

如图4所示，本实施例提供的预设识别模型的训练方法，包括：As shown in FIG. 4 , the training method of the preset recognition model provided by this embodiment includes:

步骤401，获取论文训练集。Step 401, obtaining a training set of papers.

步骤401与步骤301的具体原理和实现方式类似，此处不再赘述。The specific principles and implementation manners of step 401 and step 301 are similar, and are not repeated here.

步骤402，对训练论文中包括的段落进行分词处理，得到各个段落包括的训练分词。Step 402: Perform word segmentation on the paragraphs included in the training paper to obtain the training word segmentation included in each paragraph.

本实施例提供的方法中，在训练模型的过程中，还可以不直接训练用的论文输入模型，而是对训练论文进行分词处理，再根据得到的训练分词，对模型进行训练。In the method provided by this embodiment, in the process of training the model, instead of directly inputting the training paper into the model, the training paper may be processed for word segmentation, and then the model is trained according to the obtained training word segmentation.

其中，可以预先设置分词算法，从而根据该分词算法对训练论文包括的各个段落进行分词处理。中文分词(Chinese Word Segmentation)指的是将一个汉字序列切分成一个个单独的词。分词算法就是能够将连续的字序列按照一定的规范重新组合成词序列的过程。还可以设置分词词典，从而根据分词词典对论文中的段落进行分词处理。Wherein, a word segmentation algorithm may be preset, so as to perform word segmentation processing on each paragraph included in the training paper according to the word segmentation algorithm. Chinese Word Segmentation refers to dividing a sequence of Chinese characters into individual words. The word segmentation algorithm is the process of recombining consecutive word sequences into word sequences according to certain specifications. You can also set a word segmentation dictionary, so as to perform word segmentation processing on the paragraphs in the paper according to the word segmentation dictionary.

具体的，训练论文中可能包括多个段落，可以对每个段落都进行分词处理，从而得到各个段落对应的训练分词。在分词处理过程中，还可以去除段落中的语气词，不具有实际含义的词，例如“啊、的”等，从而仅根据具有实际意义的词汇对段落进行标识，能够降低段落识别过程中的计算量。Specifically, the training paper may include multiple paragraphs, and word segmentation processing can be performed on each paragraph to obtain the training word segmentation corresponding to each paragraph. In the process of word segmentation, the modal particles in the paragraph and the words without actual meaning, such as "ah, de", etc., can also be removed, so that the paragraph is only identified according to the words with actual meaning, which can reduce the number of words in the paragraph recognition process. amount of calculation.

步骤403，根据训练分词对模型进行训练，得到预设识别模型。Step 403: Train the model according to the training word segmentation to obtain a preset recognition model.

进一步的，本实施例提供的方法中，可以基于段落中的训练分词对模型进行训练。例如，可以先初始化模型中的权重值，再将各个段落中的训练分词作为一个分词组合，输入模型，模型可以根据当前的权重值确定出分词组合对应的段落标识，也就是该分词组合对应的段落的标识，可以将确定的标识与该段落预先设置的标识进行比对，并根据比对结果调整模型中的权重值。通过多轮的调整，就能够得到准确的预设识别模型。Further, in the method provided in this embodiment, the model may be trained based on the training word segmentation in the paragraph. For example, you can initialize the weight value in the model first, then use the training word segmentation in each paragraph as a word segmentation combination, input the model, and the model can determine the paragraph identifier corresponding to the word segmentation combination according to the current weight value, that is, the corresponding word segmentation combination. For the identification of the paragraph, the determined identification can be compared with the preset identification of the paragraph, and the weight value in the model can be adjusted according to the comparison result. Through multiple rounds of adjustment, an accurate preset recognition model can be obtained.

其中，可以确定训练分词属于的预设词库，并确定每个段落包括的训练分词属于各个预设次库的频率。对于一个段落来说，能够确定出多个训练分词，每个分词都有对应的预设词库。因此，能够统计出段落包括的训练分词属于各个预设词库的频率，例如，段落中包括5个训练分词，训练分别属于词库A、A、B、C、C，则能够得到该段落对应的预设词库频率为A-0.4、B-0.2、C-0.4。Wherein, the preset word bank to which the training word segment belongs can be determined, and the frequency at which the training word segment included in each paragraph belongs to each preset sub-bank can be determined. For a paragraph, multiple training word segments can be determined, and each word segment has a corresponding preset thesaurus. Therefore, it is possible to count the frequency that the training word segments included in the paragraph belong to each preset thesaurus. For example, if the paragraph includes 5 training word segments, and the training belongs to the word banks A, A, B, C, and C respectively, then the corresponding paragraphs can be obtained. The preset thesaurus frequencies are A-0.4, B-0.2, C-0.4.

具体的，根据频率确定各个段落对应的特征向量。可以直接将频率作为特征向量，如(0.4、0.2、0.4)。通常情况下，每个预设词库中包括的词，不可能均出现在同一段落中，因此，为了更准确的通过特征向量表达段落意义，还可以将未出现的词库概率设置为0。例如，若预设次库还包括词库D，但是段落中的训练分词未落入词库D，则特征向量可以是(0.4、0.2、0.4、0)。Specifically, the feature vector corresponding to each paragraph is determined according to the frequency. The frequency can be directly used as an eigenvector, such as (0.4, 0.2, 0.4). Under normal circumstances, the words included in each preset thesaurus cannot all appear in the same paragraph. Therefore, in order to express the meaning of the paragraph more accurately through the feature vector, the probability of not appearing in the thesaurus can also be set to 0. For example, if the preset secondary library also includes thesaurus D, but the training word segmentation in the paragraph does not fall into thesaurus D, the feature vector may be (0.4, 0.2, 0.4, 0).

进一步的，通过特征向量替代整个论文的段落，可以降低模型的数据处理量。Further, the data processing volume of the model can be reduced by replacing the paragraphs of the entire paper with feature vectors.

实际应用时，可以根据段落对应的训练特征向量及预设标识训练模型，得到预设识别模型。可以将特征向量输入模型，根据模型内部的当前权重值对特征向量进行计算，确定与特征向量对应的段落标识，也就是特征向量对应的段落的标识。例如，确定的段落标识可以是“论文内容”。可以将确定的标识与预设的标识比对，并根据比对结果调整模型中的权重值。In practical application, the preset recognition model can be obtained by training the model according to the training feature vector corresponding to the paragraph and the preset identification. The feature vector can be input into the model, and the feature vector can be calculated according to the current weight value inside the model to determine the paragraph identifier corresponding to the feature vector, that is, the identifier of the paragraph corresponding to the feature vector. For example, the determined paragraph identifier may be "thesis content". The determined identification can be compared with the preset identification, and the weight value in the model can be adjusted according to the comparison result.

其中，本实施例提供的方法，能够将训练论文进行分词处理，再基于得到的训练分词确定各个段落对应的特征向量，然后将特征向量输入到模型中，从而根据模型确定段落对应的标识，再基于预设的标识调整模型中的权重值，能够降低模型训练过程中的计算量。Among them, the method provided in this embodiment can perform word segmentation processing on the training paper, and then determine the feature vector corresponding to each paragraph based on the obtained training word segmentation, and then input the feature vector into the model, so as to determine the identifier corresponding to the paragraph according to the model, and then Adjusting the weight value in the model based on the preset identifier can reduce the amount of computation in the model training process.

图5为本发明一示例性实施例示出的论文标识装置的结构图。FIG. 5 is a structural diagram of an apparatus for identifying papers according to an exemplary embodiment of the present invention.

如图5所示，本实施例提供的论文标识装置，包括：As shown in Figure 5, the paper identification device provided by this embodiment includes:

获取模块51，用于获取待识别论文；an acquisition module 51, used to acquire the papers to be identified;

确定模块52，用于根据预设识别模型确定所述待识别论文对应的段落标识；A determination module 52, configured to determine the paragraph identifier corresponding to the to-be-recognized paper according to a preset recognition model;

本实施例提供的论文标识装置，包括获取模块，用于获取待识别论文；确定模块，用于根据预设识别模型确定待识别论文对应的段落标识；其中，预设识别模型是预先根据论文训练集训练得到的。本实施例提供的装置，能够基于预设识别模型识别出论文中各个段落对应的标识，从而解决现有技术中需要依赖于人工对论文的格式进行编辑的问题。The paper identification device provided by this embodiment includes an acquisition module for acquiring papers to be identified; a determination module for determining paragraph identifications corresponding to papers to be identified according to a preset identification model; wherein the preset identification model is pre-trained according to the papers set of training. The device provided in this embodiment can identify the corresponding identifiers of each paragraph in the paper based on the preset recognition model, thereby solving the problem that the format of the paper needs to be edited manually in the prior art.

本实施例提供的论文标识装置的具体原理和实现方式均与图1所示的实施例类似，此处不再赘述。The specific principle and implementation manner of the paper identification device provided in this embodiment are similar to the embodiment shown in FIG. 1 , and details are not repeated here.

图6为本发明另一示例性实施例示出的论文标识装置的结构图。FIG. 6 is a structural diagram of an apparatus for identifying papers according to another exemplary embodiment of the present invention.

如图6所示，在图5所述实施例的基础上，本实施例提供的论文标识装置，所述确定模块52，包括：As shown in FIG. 6 , on the basis of the embodiment shown in FIG. 5 , in the paper identification device provided in this embodiment, the determining module 52 includes:

分词单元521，用于对所述待识别论文中包括的段落进行分词处理，得到各个段落包括的分词；The word segmentation unit 521 is used to perform word segmentation processing on the paragraphs included in the to-be-recognized paper to obtain the word segmentation included in each paragraph;

确定单元522，用于根据所述预设识别模型、所述分词确定所述段落对应的标识。The determining unit 522 is configured to determine the identifier corresponding to the paragraph according to the preset recognition model and the word segmentation.

所述确定单元522具体用于：The determining unit 522 is specifically used for:

确定所述分词属于的预设词库，并确定每个段落包括的所述分词属于各个所述预设次库的频率；Determine the preset thesaurus to which the segmented word belongs, and determine the frequency at which the segmented word included in each paragraph belongs to each of the preset secondary libraries;

根据所述频率确定各个所述段落对应的特征向量；Determine the feature vector corresponding to each of the paragraphs according to the frequency;

基于所述预设识别模型根据所述特征向量确定所述段落对应的所述标识。The identifier corresponding to the paragraph is determined according to the feature vector based on the preset recognition model.

本实施例提供的论文标识装置的具体原理和实现方式均与图2所示的实施例类似，此处不再赘述。The specific principle and implementation manner of the paper identification device provided in this embodiment are similar to the embodiment shown in FIG. 2 , and details are not described herein again.

图7为本发明一示例性实施例示出的预设识别模型的训练装置的结构图。FIG. 7 is a structural diagram of an apparatus for training a preset recognition model according to an exemplary embodiment of the present invention.

如图7所示，本实施例提供的预设识别模型的训练装置，包括：As shown in FIG. 7 , the training device for the preset recognition model provided by this embodiment includes:

获取模块71，用于获取论文训练集；an acquisition module 71 for acquiring a training set of papers;

训练模块72，用于根据所述论文训练集训练模型，得到预设识别模型；A training module 72, configured to train a model according to the thesis training set to obtain a preset recognition model;

本实施例提供的预设识别模型的训练装置，包括获取模块，用于获取论文训练集；训练模块，用于根据论文训练集训练模型，得到预设识别模型；其中，论文训练集包括多篇训练论文，训练论文的段落预设有标识。本实施例提供的装置，通过利用设置有段落预设标识的论文训练集训练模型，能够得到用于表示论文的模型，从而能够基于预设识别模型识别出论文中各个段落对应的标识，解决现有技术中需要依赖于人工对论文的格式进行编辑的问题。The training device for a preset recognition model provided in this embodiment includes an acquisition module for acquiring a training set of papers; a training module for training a model according to the training set of papers to obtain a preset recognition model; wherein the training set of papers includes a plurality of papers The training paper, the paragraphs of the training paper are preset with flags. The device provided in this embodiment can obtain a model for representing a thesis by training a model using the thesis training set provided with the preset identification of the paragraph, so that the identification corresponding to each paragraph in the thesis can be identified based on the preset identification model, so as to solve the problem of the current situation. There are issues in technology that rely on manual editing of the format of a paper.

本实施例提供的预设识别模型的训练装置的具体原理和实现方式均与图3所示的实施例类似，此处不再赘述。The specific principles and implementation manners of the training device for the preset recognition model provided in this embodiment are similar to the embodiment shown in FIG. 3 , and details are not described herein again.

图8为本发明另一示例性实施例示出的预设识别模型的训练装置的结构图。FIG. 8 is a structural diagram of an apparatus for training a preset recognition model according to another exemplary embodiment of the present invention.

如图8所示，在图7所述实施例的基础上，本实施例提供的预设识别模型的训练装置，所述训练模块72包括：As shown in FIG. 8 , on the basis of the embodiment shown in FIG. 7 , in the training device for the preset recognition model provided by this embodiment, the training module 72 includes:

分词单元721，用于对所述训练论文中包括的段落进行分词处理，得到各个段落包括的训练分词；The word segmentation unit 721 is used to perform word segmentation processing on the paragraphs included in the training paper to obtain the training word segmentations included in each paragraph;

训练单元722，用于根据所述训练分词对模型进行训练，得到所述预设识别模型。The training unit 722 is configured to train the model according to the training word segmentation to obtain the preset recognition model.

所述训练单元722具体用于：The training unit 722 is specifically used for:

确定所述训练分词属于的预设词库，并确定每个段落包括的所述训练分词属于各个所述预设次库的频率；Determine the preset thesaurus to which the training participle belongs, and determine the frequency at which the training participle included in each paragraph belongs to each of the preset sub-libraries;

根据所述频率确定各个所述段落对应的训练特征向量；Determine the training feature vector corresponding to each of the paragraphs according to the frequency;

根据所述段落对应的所述训练特征向量及预设标识训练模型，得到所述预设识别模型。The preset recognition model is obtained according to the training feature vector corresponding to the paragraph and the preset identification training model.

图9为本发明一示例性实施例示出的论文标识设备的结构图。FIG. 9 is a structural diagram of a dissertation identification device according to an exemplary embodiment of the present invention.

如图9所示，本实施例提供的论文标识设备包括：As shown in Figure 9, the paper identification device provided by this embodiment includes:

存储器91；memory 91;

处理器92；以及processor 92; and

计算机程序；Computer program;

其中，所述计算机程序存储在所述存储器91中，并配置为由所述处理器92执行以实现如图1-2所述的任一种论文标识方法。Wherein, the computer program is stored in the memory 91 and is configured to be executed by the processor 92 to implement any of the paper identification methods described in FIGS. 1-2 .

如图10所示，本实施例提供的预设识别模型的训练设备包括：As shown in FIG. 10 , the training equipment for the preset recognition model provided by this embodiment includes:

存储器1001；memory 1001;

处理器1002；以及processor 1002; and

计算机程序；Computer program;

其中，所述计算机程序存储在所述存储器1001中，并配置为由所述处理器1002执行以实现如图3-4所述的任一种预设识别模型的训练方法。Wherein, the computer program is stored in the memory 1001 and is configured to be executed by the processor 1002 to implement any one of the preset recognition model training methods described in FIGS. 3-4 .

本实施例还提供一种计算机可读存储介质，其上存储有计算机程序，This embodiment also provides a computer-readable storage medium on which a computer program is stored,

所述计算机程序被处理器执行以实现如图1-2所述的任一种论文标识方法。The computer program is executed by a processor to implement any of the paper identification methods described in Figures 1-2.

所述计算机程序被处理器执行以实现如图3-4所述的任一种预设识别模型的训练方法。The computer program is executed by the processor to implement any one of the preset recognition model training methods described in FIGS. 3-4 .

本领域普通技术人员可以理解：实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时，执行包括上述各方法实施例的步骤；而前述的存储介质包括：ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps of implementing the above method embodiments may be completed by program instructions related to hardware. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, the steps including the above method embodiments are executed; and the foregoing storage medium includes: ROM, RAM, magnetic disk or optical disk and other media that can store program codes.

最后应说明的是：以上各实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述各实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features thereof can be equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present invention. scope.

Claims

1. A method of paper identification, comprising:

acquiring a paper to be identified;

determining paragraph marks corresponding to the thesis to be recognized according to a preset recognition model;

wherein, the preset recognition model is obtained by training according to a thesis training set in advance.

2. The method according to claim 1, wherein the determining the paragraph identifier corresponding to the paper to be recognized according to the preset recognition model includes:

performing word segmentation processing on the paragraphs included in the thesis to be identified to obtain the word segmentation included in each paragraph;

and determining the mark corresponding to the paragraph according to the preset recognition model and the word segmentation.

3. The method according to claim 2, wherein the determining the corresponding identifier of the paragraph according to the preset recognition model and the segmentation word comprises:

determining a preset word bank to which the participle belongs, and determining the frequency of the participle included in each paragraph belonging to each preset sub-bank;

determining a feature vector corresponding to each paragraph according to the frequency;

and determining the identifier corresponding to the paragraph according to the feature vector based on the preset identification model.

4. A training method for a preset recognition model is characterized by comprising the following steps:

acquiring a thesis training set;

training a model according to the thesis training set to obtain a preset recognition model;

the paper training set comprises a plurality of training papers, and the paragraphs of the training papers are preset with marks.

5. The method of claim 4, wherein training the model according to the thesis training set to obtain a predetermined recognition model comprises:

performing word segmentation processing on paragraphs included in the training paper to obtain training words included in each paragraph;

and training a model according to the training part words to obtain the preset recognition model.

6. The method of claim 5, wherein the training the model according to the training part word to obtain the preset recognition model comprises:

determining a preset word bank to which the training participle belongs, and determining the frequency of the training participle included in each paragraph belonging to each preset sub-bank;

determining training feature vectors corresponding to the sections according to the frequency;

and obtaining the preset recognition model according to the training feature vector corresponding to the paragraph and a preset identification training model.

7. An article identification apparatus, comprising:

the acquisition module is used for acquiring the thesis to be identified;

the determining module is used for determining paragraph identifiers corresponding to the papers to be recognized according to a preset recognition model;

8. A training device for presetting a recognition model is characterized by comprising:

the acquisition module is used for acquiring a thesis training set;

the training module is used for training a model according to the thesis training set to obtain a preset recognition model;

9. A paper identification device, comprising:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any of claims 1-3.

10. A training device for presetting a recognition model is characterized by comprising:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any of claims 4-6.

11. A computer-readable storage medium, having stored thereon a computer program,

the computer program is executed by a processor to implement the method according to any one of claims 1-3.

12. A computer-readable storage medium, having stored thereon a computer program,

the computer program is executed by a processor to implement the method according to any of claims 4-6.