JP2008015774A

JP2008015774A - Counterfeit document detection system and program

Info

Publication number: JP2008015774A
Application number: JP2006186004A
Authority: JP
Inventors: Takashi Yugawa; 高志湯川; Kazuhide Yamamoto; 和英山本; Yoshimi Fukumura; 好美福村
Original assignee: Nagaoka University of Technology NUC
Current assignee: Nagaoka University of Technology NUC
Priority date: 2006-07-05
Filing date: 2006-07-05
Publication date: 2008-01-24

Abstract

【課題】複数の文書間における類似部分を検出可能な模倣文書検出システム及びプログラムを提供することを目的とする。
【解決手段】模倣文書検出システム１では、複数の文書ファイル７を投入するための投入インタフェース部２と、投入された前記文書ファイル７を蓄積する文書蓄積部３と、入力された文書ファイルの組に対して模倣部分を検査し、当該検査結果を出力する模倣検査部４と、前記文書蓄積部３に蓄積された文書ファイル群から順次文書ファイルの組（文書ファイル７）を取り出して、前記模倣検査部４に入力し、該文書ファイルの組に対して前記模倣検査部４が出力する検査結果を保持する模倣検査駆動部５と、前記模倣検査駆動部５が保持している前記検査結果に基づいて、蓄積された前記文書ファイル７間の模倣関係を提示する結果表示部６とを具備する。
【選択図】図１An object of the present invention is to provide a counterfeit document detection system and program capable of detecting similar parts between a plurality of documents.
In a counterfeit document detection system, a combination of an input interface unit for inputting a plurality of document files, a document storage unit for storing the input document files, and an input document file The imitation inspection unit 4 that inspects the imitation part and outputs the inspection result, and sequentially extracts a set of document files (document file 7) from the document file group stored in the document storage unit 3, and the imitation The imitation inspection driving unit 5 that holds the inspection result input to the inspection unit 4 and output from the counterfeit inspection unit 4 for the set of document files, and the inspection result held by the imitation inspection driving unit 5 And a result display unit 6 for presenting the imitation relationship between the stored document files 7.
[Selection] Figure 1

Description

本発明は、複数の文書間における模倣部分を検出可能な模倣文書検出システム及びプログラムに関する。 The present invention relates to a counterfeit document detection system and program capable of detecting a counterfeit portion between a plurality of documents.

近年、レポート作成の電子化（ワードプロセッサの利用等）が一般的となるのに伴い、他人（例えば先輩など）のレポートやインターネット上のＷｅｂページを、剽窃するケースが増加している。これをチェックする教員の負担は大きく、また見逃しの可能性もあるため、コンピュータによる支援が必要である。特にｅラーニングにおいては、レポート等も電子的形式で提出されること、受講者の行動が直接には見えないことから、模倣のチェックを電子的に行なう意味は大きい。
特開２００３−３０２３８号公報 In recent years, the digitization of report creation (use of a word processor, etc.) has become common, and cases of plagiarizing reports of others (for example, seniors) and Web pages on the Internet are increasing. The burden on teachers to check this is heavy, and there is a possibility of oversight, so computer support is necessary. Especially in e-Learning, since reports and the like are submitted in an electronic format, and the behavior of the students cannot be seen directly, it is meaningful to check imitation electronically.
JP 2003-30238 A

しかし、従来、検索対象となる複数の文書内から、各文書間で相互に類似する模倣部分を検出するシステムは存在していなかった。 However, heretofore, there has not been a system for detecting imitation parts that are similar to each other among a plurality of documents to be searched.

関連する周知技術として、文書の集合の中から検索キーに該当する文書を検索する技術（例えば特許文献１）があるが、当該技術を利用して類似する可能性がある関連文書を検索しても、レポートの模倣チェックを行なう場合には、結局教員がチェック対象となるレポートと、前記関連文書とを逐一見比べてチェックしなければならなかった。このような背景から、人手に頼っていたチェック作業の自動化が切望されていた。 As a related well-known technique, there is a technique (for example, Patent Document 1) for searching a document corresponding to a search key from a set of documents. By using the technique, a related document that is likely to be similar is searched. However, when a report imitation check is performed, the teacher eventually has to check the report to be checked against the related documents one by one. Against this background, there was a keen desire to automate check operations that relied on human hands.

そこで本発明は上記問題点に鑑み、複数の文書間における類似部分を検出可能な模倣文書検出システム及びプログラムを提供することを目的とする。 In view of the above problems, an object of the present invention is to provide a counterfeit document detection system and program capable of detecting a similar portion between a plurality of documents.

本発明における請求項１の模倣文書検出システムでは、複数の文書ファイルを投入するための投入インタフェース部と、投入された前記文書ファイルを蓄積する文書蓄積部と、入力された文書ファイルの組に対して模倣部分を検査し、当該検査結果を出力する模倣検査部と、前記文書蓄積部に蓄積された文書ファイル群から順次文書ファイルの組を取り出して、前記模倣検査部に入力し、該文書ファイルの組に対して前記模倣検査部が出力する検査結果を保持する模倣検査駆動部と、前記模倣検査駆動部が保持している前記検査結果に基づいて、蓄積された前記文書ファイル間の模倣関係を提示する結果表示部とを具備する。 In the imitation document detection system according to claim 1 of the present invention, an input interface unit for inputting a plurality of document files, a document storage unit for storing the input document files, and a set of input document files The imitation inspection unit that inspects the imitation part and outputs the inspection result, and sequentially extracts a set of document files from the document file group stored in the document storage unit, and inputs the document file to the imitation inspection unit. The imitation inspection driving unit that holds the inspection result output by the counterfeit inspection unit for the set of, and the imitation relationship between the accumulated document files based on the inspection result that the imitation inspection driving unit holds And a result display unit for presenting.

このようにすると、システムに投入した複数の文書ファイル間の模倣を検査して、模倣文書を検出することができる。 In this way, it is possible to detect imitation documents by inspecting imitation between a plurality of document files input to the system.

本発明における請求項２の模倣文書検出システムでは、前記文書蓄積部が、蓄積すべき前記文書ファイルを分類できるものであり、前記模倣検査駆動部が、前記文書蓄積部における分類とユーザからの指示に基づいて、順次前記文書ファイルの組を前記文書蓄積部から取り出して、前記模倣検査部に入力し、該文書ファイルの組に対して前記模倣検査部が出力する検査結果を保持するものであることを特徴とする。 In the counterfeit document detection system according to claim 2 of the present invention, the document storage unit can classify the document file to be stored, and the imitation inspection driving unit performs classification in the document storage unit and an instruction from the user. The document file sets are sequentially extracted from the document storage unit and input to the counterfeit inspection unit, and the inspection result output by the counterfeit inspection unit is held for the document file set. It is characterized by that.

このようにすると、ユーザは模倣検査対象となる分類を自由に指定でき、当該分類に属する文書ファイルについて模倣検査を行なうことができる。 In this way, the user can freely specify the classification to be subjected to the imitation inspection, and can perform the imitation inspection on the document file belonging to the classification.

本発明における請求項３の模倣文書検出システムでは、前記結果表示部が、模倣関係のある前記文書ファイルの組の文書ファイル名と、該文書ファイルの組に対する模倣度とを一覧として表形式で表示するとともに、該文書ファイルの組の一覧中における任意の文書ファイルの組について模倣部分を表示するものであることを特徴とする。 In the counterfeit document detection system according to the third aspect of the present invention, the result display unit displays the document file name of the set of document files having a counterfeit relationship and the degree of imitation for the set of document files in a tabular form. In addition, the imitation part is displayed for an arbitrary set of document files in the list of sets of document files.

このようにすると、一覧表により模倣関係のある文書ファイルを一目で確認することができ、文書ファイル内のどの部分が模倣部分であるかを具体的に確認することができる。 In this way, it is possible to confirm at a glance a document file having a counterfeit relationship from the list, and it is possible to specifically confirm which part in the document file is the imitation part.

本発明における請求項４の模倣文書検出システムでは、文書ファイルを投入するための投入インタフェース部と、該文書投入インターフェース部を介して入力された前記文書ファイルを蓄積する文書蓄積部と、入力された前記文書ファイルに記述された文章に基づいて検索語群を生成し、該生成した前記検索語群に基づいて、システム外部の電子的にアクセス可能な文書を検索してその検索結果の文書群を出力する文書検索部と、該投入された前記文書ファイルを前記文書検索部に入力して得られる前記検索結果の文書群を個々の文書に対して識別可能な識別子を付与した上で検索結果文書ファイルとして蓄積する検索結果文書蓄積部と、入力された文書ファイルの組に対して模倣部分を検査し、当該検査結果を出力する模倣検査部と、前記検索結果文書蓄積部から前記検索結果文書ファイルを順次取り出して、投入されて前記文書蓄積部に蓄積された前記文書ファイルとの組を作り、該文書ファイルの組を前記模倣検査部に入力し、該文書ファイルの組に対して前記模倣検査部が出力する検査結果を保持する模倣検査駆動部と、前記模倣検査駆動部が保持している前記検査結果に基づいて、投入された前記文書ファイルと前記文書検索部により検索された前記検索結果文書ファイルとの間の模倣関係を提示する結果表示部とを具備する。 In the imitation document detection system according to claim 4 of the present invention, an input interface unit for inputting a document file, a document storage unit for storing the document file input via the document input interface unit, and an input A search word group is generated based on the text described in the document file, and an electronically accessible document outside the system is searched based on the generated search word group, and a document group as a search result is obtained. A search result document obtained by assigning an identifier for identifying a document group of the search result obtained by inputting the input document file to the document search unit and outputting the document group of the search result. A search result document accumulating unit that accumulates as a file, an imitation inspection unit that inspects an imitation part against a set of input document files, and outputs the inspection result; The search result document file is sequentially extracted from the search result document storage unit, and a set is made with the document file that has been input and stored in the document storage unit, and the set of document files is input to the imitation checking unit, The imitation inspection driving unit that holds the inspection result output by the counterfeit inspection unit for the set of document files, and the document file that has been input based on the inspection result that the imitation inspection driving unit holds A result display unit that presents the imitation relationship with the search result document file searched by the document search unit.

このようにすると、システムに投入した検査対象となる文書ファイルと、システム外部から検索された不特定多数の関連文書との間における模倣を検査することにより、投入した文書ファイルが模倣文書であるか否かをより広い文書範囲で判定することができる。 In this way, whether the input document file is a counterfeit document by checking imitation between the document file to be inspected input to the system and a large number of unspecified related documents retrieved from outside the system. Whether or not can be determined in a wider document range.

本発明における請求項５の模倣文書検出システムでは、前記結果表示部が、投入された前記文書ファイルと模倣関係のある前記検索結果文書ファイルの文書ファイル識別子と、該検索結果文書ファイルに対する模倣度とを一覧として表形式で表示するとともに、該文書ファイル識別子一覧中における任意の前記検索結果文書ファイルについて、投入された前記文書ファイルとの模倣部分を表示するものであることを特徴とする。 In the counterfeit document detection system according to claim 5 of the present invention, the result display unit includes a document file identifier of the search result document file having a counterfeit relationship with the input document file, and a degree of imitation of the search result document file. Are displayed as a list in a table format, and for any of the search result document files in the document file identifier list, an imitation part with the input document file is displayed.

本発明における請求項６の模倣文書検出システムでは、複数の文書ファイルを投入するための投入インタフェース部と、投入された前記文書ファイルを蓄積する文書蓄積部と、入力された前記文書ファイルに記述された文章に基づいて検索語群を生成し、該生成した前記検索語群に基づいて、外部の電子的にアクセス可能な文書を検索してその検索結果の文書群を出力する文書検索部と、前記文書蓄積部に蓄積された個々の前記文書ファイルに対して、該文書ファイルを前記文書検索部に入力して得られる前記検索結果の文書群を個々の文書に対して識別可能な識別子を付与した上で検索結果文書ファイルとして蓄積する検索結果文書蓄積部と、入力された文書ファイルの組に対して模倣部分を検査し、当該検査結果を出力する模倣検査部と、前記文書蓄積部に蓄積された前記文書ファイル群及び前記検索結果文書蓄積部に蓄積された検索結果文書ファイル群の中から順次文書ファイルの組を取り出して、前記模倣検査部に入力し、該文書ファイルの組に対して前記模倣検査部が出力する検査結果を保持する模倣検査駆動部と、前記模倣検査駆動部が保持している前記検査結果に基づいて、蓄積された前記文書ファイル間及び投入された前記文書ファイルと前記文書検索部により検索された前記検索結果文書ファイルとの間の模倣関係を提示する結果表示部とを具備する。 In the counterfeit document detection system according to claim 6 of the present invention, the input interface unit for inputting a plurality of document files, the document storage unit for storing the input document files, and the input document file are described. A document search unit that generates a search word group based on the generated text, searches an externally electronically accessible document based on the generated search word group, and outputs a document group of the search result; For each of the document files stored in the document storage unit, an identifier capable of identifying the document group of the search result obtained by inputting the document file to the document search unit is assigned to each document file. A search result document storage unit that stores the result as a search result document file, and a counterfeit inspection unit that inspects the imitation part against the set of input document files and outputs the inspection result; A set of document files is sequentially extracted from the document file group stored in the document storage unit and the search result document file group stored in the search result document storage unit, and is input to the imitation checking unit, and the document A counterfeit inspection driving unit that holds the inspection result output by the counterfeit inspection unit for a set of files, and between the stored document files and input based on the inspection result held by the counterfeit inspection driving unit And a result display unit for presenting a mimicry relationship between the retrieved document file and the retrieval result document file retrieved by the document retrieval unit.

このようにすると、システムに投入した複数の文書ファイル間の模倣を検査すると共に、システムに投入した文書ファイルと、システム外部から検索された不特定多数の関連文書との間における模倣を検査することにより、投入した文書ファイルが模倣文書であるか否かをさらに広い文書範囲で判定することができる。 In this way, the imitation between a plurality of document files input to the system is checked, and the imitation between the document file input to the system and an unspecified number of related documents searched from outside the system is checked. Thus, it can be determined in a wider document range whether the input document file is a counterfeit document.

本発明における請求項７の模倣文書検出システムでは、前記文書蓄積部及び前記検索結果文書蓄積部が、蓄積すべき前記文書ファイル及び前記検索結果文書ファイルを分類できるものであり、前記模倣検査駆動部が、前記文書蓄積部及び前記検索結果文書蓄積部における分類とユーザからの指示に基づいて、順次前記文書ファイルの組を前記文書蓄積部及び前記検索結果文書蓄積部から取り出して、前記模倣検査部に入力し、該文書ファイルの組に対して前記模倣検査部が出力する検査結果を保持するものであることを特徴とする。 In the imitation document detection system according to claim 7 of the present invention, the document storage unit and the search result document storage unit can classify the document file and the search result document file to be stored, and the imitation inspection driving unit Based on the classification in the document storage unit and the search result document storage unit and the instruction from the user, the set of document files is sequentially extracted from the document storage unit and the search result document storage unit, and the imitation inspection unit The inspection result output by the counterfeit inspection unit for the set of document files is held.

本発明における請求項８の模倣文書検出システムでは、前記結果表示部が、模倣関係のある前記文書ファイルの組の文書ファイル名又は文書ファイル識別子と、該文書ファイルの組に対する模倣度とを一覧として表形式で表示するとともに、該文書ファイルの組の一覧中における任意の文書ファイルの組について模倣部分を表示するものであることを特徴とする。 In the counterfeit document detection system according to claim 8 of the present invention, the result display unit displays a list of document file names or document file identifiers of the set of document files having a counterfeit relationship, and imitation levels for the set of document files. In addition to displaying in tabular form, the imitation part is displayed for an arbitrary set of document files in the list of sets of document files.

本発明における請求項９の模倣文書検出システムでは、前記結果表示部が、模倣関係のある前記文書ファイルの組の集合について、該集合に含まれる個々の文書ファイルを節点とし、文書ファイル間に模倣関係がある場合に該模倣関係のある文書ファイルの節点間を枝により結び、該文書ファイル間の模倣度に基づいて該枝の視覚的特徴を決定し、グラフとして表示するものであることを特徴とする。 In the counterfeit document detection system according to claim 9 of the present invention, the result display unit uses a set of document files in the set as a node for the set of sets of document files having a counterfeit relationship, and imitates between the document files. When there is a relationship, the nodes of the document file having the imitation relationship are connected by a branch, the visual characteristics of the branch are determined based on the degree of imitation between the document files, and displayed as a graph And

このようにすると、文書ファイルの検査結果がグラフ化されるので、各文書ファイル間の模倣関係が視覚的に明確となり、ユーザが検査結果を容易に理解することができる。 In this way, the inspection result of the document file is graphed, so that the imitation relationship between the document files becomes visually clear, and the user can easily understand the inspection result.

本発明における請求項10のプログラムでは、コンピュータを、前記請求項１〜９のいずれか一つの模倣文書検出システムとして機能させる。 According to a program of a tenth aspect of the present invention, a computer is caused to function as the counterfeit document detection system according to any one of the first to ninth aspects.

このようにすると、コンピュータにより模倣文書検出システムを容易に構築することができる。 In this way, a counterfeit document detection system can be easily constructed by a computer.

本発明の請求項１によると、複数の文書間における類似部分を検出可能な模倣文書検出システムを提供することができる。 According to the first aspect of the present invention, it is possible to provide a counterfeit document detection system capable of detecting a similar portion between a plurality of documents.

本発明の請求項２によると、模倣検査対象を分類別に選択可能とすることで、検査効率を向上させることができる。 According to claim 2 of the present invention, the inspection efficiency can be improved by making it possible to select the imitation inspection object by classification.

本発明の請求項３によると、ユーザに対して模倣検査結果をわかりやすく提示することができる。 According to the third aspect of the present invention, the imitation inspection result can be presented to the user in an easily understandable manner.

本発明の請求項４によると、システム外部の関連文書を用いて、より広い文書範囲における模倣を検出可能な模倣文書検出システムを提供することができる。 According to claim 4 of the present invention, it is possible to provide a counterfeit document detection system capable of detecting imitation in a wider document range using related documents outside the system.

本発明の請求項５によると、ユーザに対して模倣検査結果をわかりやすく提示することができる。 According to claim 5 of the present invention, the imitation inspection result can be presented to the user in an easily understandable manner.

本発明の請求項６によると、複数の文書間における類似部分を検出可能、かつシステム外部の関連文書を用いて、より広い文書範囲における模倣を検出可能な模倣文書検出システムを提供することができる。 According to claim 6 of the present invention, it is possible to provide a counterfeit document detection system that can detect a similar portion between a plurality of documents and can detect imitation in a wider document range by using a related document outside the system. .

本発明の請求項７によると、模倣検査対象を分類別に選択可能とすることで、検査効率を向上させることができる。 According to claim 7 of the present invention, the inspection efficiency can be improved by making it possible to select the imitation inspection object by classification.

本発明の請求項８によると、ユーザに対して模倣検査結果をわかりやすく提示することができる。 According to claim 8 of the present invention, the imitation inspection result can be presented to the user in an easily understandable manner.

本発明の請求項９によると、ユーザに対して模倣検査結果をよりわかりやすく視覚的に提示することができる。 According to the ninth aspect of the present invention, it is possible to visually present the imitation inspection result to the user in an easy-to-understand manner.

本発明の請求項10によると、コンピュータにより模倣文書検出システムを容易に構築することができる。 According to claim 10 of the present invention, a counterfeit document detection system can be easily constructed by a computer.

以下、添付図面を参照しながら、本発明における模倣文書検出システム及びプログラムの好ましい各実施例を説明する。なお、各実施例において同一箇所には同一符号を付し、共通する部分の説明は重複するため極力省略する。また、各実施例では模倣文書検出システムについてのみ説明するが、本発明の模倣文書検出プログラムは、インストールされたコンピュータを、各実施例のシステムの各構成部と同等に機能させるものであるため、その説明を省略する。 Hereinafter, preferred embodiments of a counterfeit document detection system and program according to the present invention will be described with reference to the accompanying drawings. In addition, in each Example, the same code | symbol is attached | subjected to the same location, Since description of a common part overlaps, it abbreviate | omits as much as possible. In each embodiment, only the counterfeit document detection system will be described, but the counterfeit document detection program of the present invention causes the installed computer to function in the same manner as each component of the system of each embodiment. The description is omitted.

図１は、第１実施例における模倣文書検出システムのシステム構成図である。模倣文書検出システム１は、投入インタフェース部２と、文書蓄積部３と、模倣検査部４と、模倣検査駆動部５と、結果表示部６とを具備して構成される。 FIG. 1 is a system configuration diagram of a counterfeit document detection system in the first embodiment. The counterfeit document detection system 1 includes an input interface unit 2, a document storage unit 3, a counterfeit inspection unit 4, a counterfeit inspection drive unit 5, and a result display unit 6.

文書投入インタフェース部２は、ユーザが例えばキーボードやマウスなどの操作入力手段を用いて、複数の電子形式ファイル７（以下、文書ファイル７と呼ぶ）を文書蓄積部３へ投入することを可能とするものであり、ユーザが実際に操作する操作画面等がこれに相当する。なお、文書ファイル７は、例えばレポート，論文，小説，新聞記事などあらゆる分野のあらゆる文書としてよい。 The document input interface unit 2 allows a user to input a plurality of electronic format files 7 (hereinafter referred to as document files 7) to the document storage unit 3 using operation input means such as a keyboard and a mouse. An operation screen or the like that is actually operated by the user corresponds to this. The document file 7 may be any document in any field, such as a report, a paper, a novel, or a newspaper article.

文書投入インタフェース部２は、模倣文書検出システム１のシステム構成により、その構成態様が若干異なる。すなわち、模倣文書検出システム１が、例えばインターネットやＬＡＮなどのネットワークを利用して構築されたシステムであれば、文書投入インタフェース部２はユーザが操作するクライアントに該当し、ネットワークを介して、サーバに内蔵された文書蓄積部３へ文書ファイル７をアップロードするためのデータ通信手段を備える。当該クライアントにおいては、前記操作画面が例えば専用アプリケーションやＷｅｂページなどによって、ユーザに対して提供されることとなる。 The configuration of the document input interface unit 2 is slightly different depending on the system configuration of the counterfeit document detection system 1. That is, if the counterfeit document detection system 1 is a system constructed by using a network such as the Internet or a LAN, for example, the document input interface unit 2 corresponds to a client operated by the user, and is transmitted to the server via the network. Data communication means for uploading the document file 7 to the built-in document storage unit 3 is provided. In the client, the operation screen is provided to the user by, for example, a dedicated application or a Web page.

文書蓄積部３は、投入された文書ファイル７を文書投入インタフェース部２から受け取り蓄積するものであり、例えばハードディスクなどの記憶装置に保存されたデータベース等がこれに相当する。この文書蓄積部３により、文書ファイル７がシステム内で保存，管理される。とりわけ、文書蓄積部３は、文書ファイル７の保存に際して、ユーザからの指示に基づいて蓄積すべき文書ファイル７を、例えばレポート提出年度毎などに分類できるように構成されている。当該分類は、ユーザからの指示に限らず、あらかじめ設定された分類基準に基づいて自動的に行なってもよい。 The document storage unit 3 receives and stores the input document file 7 from the document input interface unit 2 and corresponds to, for example, a database stored in a storage device such as a hard disk. The document storage unit 3 stores and manages the document file 7 in the system. In particular, the document storage unit 3 is configured to be able to classify the document file 7 to be stored based on an instruction from the user, for example, every report submission year when the document file 7 is stored. The classification is not limited to an instruction from the user, and may be automatically performed based on a preset classification standard.

模倣検査部４は、入力された二つの文書ファイル７に対して、相互に模倣部分を検査し、模倣部分と模倣度とを結果として出力するものであるが、その模倣検査アルゴリズムについては後述する。 The counterfeit inspection unit 4 inspects the imitation part of the two input document files 7 and outputs the imitation part and the imitation degree as a result. The imitation inspection algorithm will be described later. .

模倣検査駆動部５は、文書蓄積部３における分類とユーザからの指示に基づいて、文書蓄積部３に蓄積された文書ファイル群から順次二つの文書ファイル７の組を取り出して、模倣検査部４に入力し、該文書ファイル７の組に対して模倣検査部４が出力する検査結果を保持するものである。 The counterfeit inspection driving unit 5 sequentially takes out a set of two document files 7 from the document file group stored in the document storage unit 3 based on the classification in the document storage unit 3 and the instruction from the user, and the counterfeit inspection unit 4. And the inspection result output by the counterfeit inspection unit 4 for the set of the document files 7 is held.

模倣文書検出システム１が、ネットワークを利用して構築されたシステムであれば、これら文書蓄積部３，模倣検査部４，模倣検査駆動部５はサーバ側に備えられ、クライアントからの要求に応じて模倣検査処理を実行し、模倣検査処理結果のみを要求元のクライアントへ返答する。 If the counterfeit document detection system 1 is a system constructed using a network, the document storage unit 3, the counterfeit inspection unit 4, and the counterfeit inspection drive unit 5 are provided on the server side in response to a request from the client. The counterfeit inspection process is executed, and only the result of the counterfeit inspection process is returned to the requesting client.

結果表示部６は、模倣検査駆動部５が保持している模倣検査結果に基づいて、ユーザに対して、蓄積された文書ファイル７間の模倣関係を提示するものであり、ユーザが実際に視認する検査結果表示画面等がこれに相当する。より詳細には、結果表示部６は、模倣検査駆動部５から検査結果を受け取って、模倣関係のある文書ファイル７の組の文書ファイル名と、該文書ファイル７の組に対する模倣度とを一覧として表形式で表示するとともに、ユーザの操作により該文書ファイル７の組の一覧中における任意の文書ファイル７の組について模倣部分を表示する。また、検査結果の別の表示態様として、模倣関係のある文書ファイル７の組の集合について、該集合に含まれる個々の文書ファイル７を節点とし、文書ファイル７間に模倣関係がある場合に、該模倣関係のある文書ファイル７の節点間を枝により結び、該文書ファイル７間の模倣度に基づいて、該枝の例えば長さ，太さ，色等の視覚的特徴を決定し、グラフとして表示する。 The result display unit 6 presents the imitation relationship between the accumulated document files 7 to the user based on the imitation inspection result held by the imitation inspection driving unit 5, and the user actually visually recognizes the result. An inspection result display screen or the like to correspond to this. More specifically, the result display unit 6 receives the inspection result from the counterfeit inspection driving unit 5 and lists the document file name of the set of document files 7 having the imitation relationship and the imitation degree for the set of document files 7. Are displayed in a tabular form, and an imitation portion is displayed for an arbitrary set of document files 7 in the list of sets of document files 7 by a user operation. Further, as another display mode of the inspection result, with respect to a set of sets of document files 7 having imitation relation, each document file 7 included in the set is a node, and there is a imitation relation between the document files 7, The nodes of the document file 7 having the imitation relation are connected by branches, and the visual characteristics such as length, thickness, color, etc. of the branches are determined based on the degree of imitation between the document files 7, and a graph is obtained. indicate.

結果表示部６は、文書投入インタフェース部２と同様に、模倣文書検出システム１が、ネットワークを利用して構築されたシステムであれば、結果表示部６はユーザが操作するクライアントに該当し、ネットワークを介して、サーバに内蔵された模倣検査駆動部５から検査結果データをダウンロードするためのデータ通信手段を備える。当該クライアントにおいては、前記検査結果表示画面等が例えば専用アプリケーションやＷｅｂページなどによって、ユーザに対して提供されることとなる。 Similar to the document input interface unit 2, the result display unit 6 corresponds to the client operated by the user if the counterfeit document detection system 1 is a system constructed using a network. The data communication means for downloading the inspection result data from the counterfeit inspection driving unit 5 built in the server is provided. In the client, the inspection result display screen or the like is provided to the user by, for example, a dedicated application or a Web page.

ここで、模倣検査部４に実装される模倣検査アルゴリズムの具体的な実施例について説明する。模倣検査部４では、類似文字解析手法を利用して、文書間に存在する類似文字列を検出する。なお、これら以外にも例えばベクトル空間法など周知の類似文字解析手法が適用可能である。これらの解析手法にはそれぞれ一長一短があるため、模倣検査時にユーザが選択できるようにしてもよい。 Here, the specific Example of the imitation inspection algorithm mounted in the imitation inspection part 4 is demonstrated. The counterfeit inspection unit 4 uses a similar character analysis method to detect a similar character string existing between documents. In addition to these, for example, a well-known similar character analysis method such as a vector space method can be applied. Since each of these analysis methods has advantages and disadvantages, the user may be able to select at the time of imitation inspection.

まず、n-gram解析（文字n-gram解析）について説明する。これはもっとも基礎的なn-gram解析であり、文字を単位としたn-gramを利用して一致を検出する。以下、当該文字n-gram解析について具体例を挙げて説明する。例えば、「あいうえおかきくけこさしすあいうえおかき」というドキュメントＡと「うえおかきくけたちつ」というドキュメントＢがあったとする。それぞれのドキュメントの5-gramは、図４に示す通りになる。括弧内は、n-gramにおける先頭文字の位置である。この中から相互に一致する組を見つけると、A(03)-B(01)，A(04)-B(02)，A(05)-B(03)，A(16)-B(01)の４組となる。 First, n-gram analysis (character n-gram analysis) will be described. This is the most basic n-gram analysis, and matches are detected using n-grams in character units. Hereinafter, the character n-gram analysis will be described with a specific example. For example, it is assumed that there is a document A “Ai Ueoki Kakusashi Ai Ueokiki” and a document B “Ueokaki Kaketsutsutsu”. The 5-gram of each document is as shown in FIG. The parenthesis is the position of the first character in the n-gram. When a pair that matches each other is found, A (03) -B (01), A (04) -B (02), A (05) -B (03), A (16) -B (01 ) 4 sets.

これより一致検出結果のマトリックスは図５に示すものとなる。同図では、ドキュメントＡを横方向、ドキュメントＢを縦方向にとっており、前記相互に一致する組に対応する要素を１としている。なお、同図中、１でない場所には実際には値として０が入るが、ここでは省略されている。基本的に１が入っている場所に対応するn-gramが、類似部分すなわち模倣部分として検出されるが、１が所定数以上連続して斜めに並ぶ部分に対応するn-gram集合を模倣部分として検出してもよい。当該一致検出は、ドキュメントＡのn-gramの個々のエントリとドキュメントＢのn-gramの個々のエントリとを総当りすれば可能である。 Thus, the matrix of coincidence detection results is as shown in FIG. In the figure, document A is in the horizontal direction and document B is in the vertical direction, and the element corresponding to the mutually matching set is 1. In the figure, 0 is actually entered as a value at a place other than 1, but it is omitted here. An n-gram corresponding to a place where 1 is entered is basically detected as a similar part, that is, an imitation part, but an imitation part of an n-gram set corresponding to a part where 1 is continuously arranged diagonally. You may detect as. The coincidence detection can be performed by hitting each entry of the n-gram of the document A and each entry of the n-gram of the document B.

しかし、より効率的な処理方法が考えられる。n-gramの個々のエントリを辞書順にソートし、同一エントリの位置情報を、そのエントリにリストとして持たせる。すなわち、図６に示すリストであり、これをインバーテッド・リストという。この例ではもともとn-gramエントリがほとんど辞書順になっていた（もとのドキュメントがその内容となっていた）ため、あまり違いがないように見えるが、実際のドキュメントの場合には、前述の単なるn-gramの羅列とは様子がかなり違ったものとなるはずである。 However, a more efficient processing method can be considered. Each entry of n-gram is sorted in dictionary order, and the position information of the same entry is given to the entry as a list. That is, the list shown in FIG. 6 is called an inverted list. In this example, the n-gram entries were mostly in lexicographic order (the original document was the content), so it seems that there is not much difference, but in the case of an actual document, It should look very different from the n-gram list.

インバーテッド・リストはエントリが辞書順にソートされているため、ドキュメントＡとドキュメントＢとの一致エントリを調べるには、総当りする必要はなく、ドキュメントＡ用のポインタとドキュメントＢ用のポインタを用い、両ポインタが示す位置のエントリ同士を比較して、小さい方を示しているポインタを進めることによって、一致エントリを検出できる。ドキュメントＡのエントリ数をＮ，ドキュメントＢのエントリ数をＭとすると、総当りでは、Ｎ×Ｍ回の比較が必要になる。一方、インバーテッド・リストにおけるエントリ数をそれぞれＮ´，Ｍ´とすると、これを用いた場合の比較回数はＮ´＋Ｍ´で済むので、比較回数（＝処理量）を大幅に低減できる（得られる結果は当然ながら同じ）。 Since the entries in the inverted list are sorted in lexicographic order, it is not necessary to make a brute force check for matching entries between document A and document B, using the pointers for document A and B, A match entry can be detected by comparing the entries at the positions indicated by both pointers and advancing the pointer indicating the smaller one. Assuming that the number of entries in document A is N and the number of entries in document B is M, a total of N × M comparisons is required. On the other hand, if the number of entries in the inverted list is N ′ and M ′, respectively, the number of comparisons using N ′ and M ′ is sufficient, so that the number of comparisons (= processing amount) can be greatly reduced. The result is of course the same).

次に、別のn-gram解析として単語n-gram解析について説明する。この解析手法では、文字のかわりに単語（形態素解析して得られた自立語）を用いる。ドキュメントＡを「柿食えば鐘が鳴るなり法隆寺」，ドキュメントＢを「隣の客は良く柿食う客だ」とすると、これを形態素解析して得られる自立語のリストは、（形態素解析器によって違うが）ドキュメントＡが＜柿＞＜食う＞＜鐘＞＜鳴る＞＜法隆寺＞、ドキュメントＢが＜隣＞＜客＞＜良い＞＜柿＞＜食う＞＜客＞になる。これの単語2-gramをとると、図７に示す通りとなる。同図において一致を調べると、A(01)-B(04)となる。一致検出結果のマトリックスは、文字n-gramの場合と同様になる。また、一致検出アルゴリズムも同様である。 Next, word n-gram analysis will be described as another n-gram analysis. In this analysis method, words (independent words obtained by morphological analysis) are used instead of characters. If Document A is "Horyuji, the bell rings if you eat it," and Document B is "The next customer is a customer who eats well," a list of free-standing words obtained by morphological analysis is (by the morphological analyzer) Document A becomes <柿> <eat> <bell> <ring> <Horyuji> and document B becomes <neighbor> <customer> <good> <柿> <eat> <customer>. Taking this word 2-gram, it is as shown in FIG. When matching is examined in the figure, A (01) -B (04) is obtained. The matrix of the coincidence detection result is the same as in the case of the character n-gram. The same applies to the coincidence detection algorithm.

以下、上記構成により構築された模倣文書検出システム１の作用について、その操作方法と共に説明する。ここでは、模倣文書検出システム１として、ネットワークを利用したクライアントサーバシステムを採用し、文書ファイル７としてのレポートに対して模倣検査を行なう場合を考える。 Hereinafter, the operation of the counterfeit document detection system 1 constructed by the above configuration will be described together with its operation method. Here, a case where a client server system using a network is adopted as the counterfeit document detection system 1 and a counterfeit inspection is performed on a report as the document file 7 is considered.

ユーザは、解析を行なうレポートを準備し、クライアントのＷｅｂブラウザ等を利用して、文書投入インタフェース部２が提供する操作画面ページにアクセスする。当該操作画面ページを通じてサーバの文書蓄積部３へレポートをアップロードし、解析実行操作を行なう。このとき、解析比較対象の分類についての選択や、解析手法の選択等も行なう。 The user prepares a report to be analyzed, and accesses an operation screen page provided by the document input interface unit 2 using a client Web browser or the like. A report is uploaded to the document storage unit 3 of the server through the operation screen page, and an analysis execution operation is performed. At this time, the selection of the analysis comparison target classification, the selection of the analysis method, and the like are also performed.

クライアントからの解析実行要求を受けて、サーバ内では、模倣検査駆動部５が、文書蓄積部３に蓄積された複数のレポート群の中から比較する二つのレポートの組を順次取り出し、模倣検査部４へ入力する。模倣検査部４は、上述の模倣検査アルゴリズムに従って、入力された二つのレポート間において相互に類似する模倣部分を検査し、検出された模倣部分とその模倣度（類似度）とを検査結果として模倣検査駆動部５へ出力する。当該検査結果は、模倣検査駆動部５から結果表示部６へ伝送され、結果表示部６でＷｅｂページに埋め込まれることにより、解析結果表示ページがクライアント側で表示される。 Upon receiving the analysis execution request from the client, in the server, the counterfeit inspection driving unit 5 sequentially extracts a set of two reports to be compared from among a plurality of report groups stored in the document storage unit 3, and the counterfeit inspection unit Input to 4. The imitation inspection unit 4 inspects imitation parts that are similar to each other between the two input reports according to the above-described imitation inspection algorithm, and imitates the detected imitation part and its imitation degree (similarity) as an inspection result. Output to the inspection drive unit 5. The inspection result is transmitted from the counterfeit inspection driving unit 5 to the result display unit 6, and embedded in the Web page by the result display unit 6, whereby the analysis result display page is displayed on the client side.

図８は、解析結果表示ページ10を図示したものである。ページ下方には、模倣検出された二つのレポートの組の一覧表11が表示されている。この一覧表11には、左から、一のレポートのファイル名（ファイル名１），もう一つのレポートのファイル名（ファイル名２），最長一致数，全体の一致数，詳細表示へのリンク16（詳細を見る）が掲載されている。これらの表示は、解析結果の項目に応じて適宜変更してよい。最上部にある解析方法のリンク12をクリックすると、一覧表11の表示内容がその解析方法による結果一覧に切り替わり、解析方法別に一覧表示することができる。一覧表11に一覧表示する解析結果の指定をリストボックス13により行なうことができ、表示種別として「最長一致数、全体の一致数、ファイル１のオリジナル度数」があり、また、表示種別に対して「件数、数値、全件」を指定して一覧表11に表示する結果を操作できる。当該指定は、再表示ボタン14をクリックすることで確定し、一覧表11の表示内容が変更され再表示される。なお、計算式（１−（一致gram数／総gram数））×１００に従い、各レポートのオリジナル度数を算出している。算出されたオリジナル度数は、例えば一覧表11に掲載されたファイル名の後ろに赤字などで記載される。ここで、オリジナル度数の計算式はあくまでも本実施例における一例であり、本発明におけるオリジナル度数は、ここで示した計算式に限定されるものではない。 FIG. 8 shows the analysis result display page 10. At the bottom of the page, a list 11 of a set of two reports in which imitation detection is performed is displayed. In this list 11, from the left, the file name of one report (file name 1), the file name of another report (file name 2), the longest match count, the total match count, and the link 16 to the detailed display (See details). These displays may be appropriately changed according to the analysis result item. When the analysis method link 12 at the top is clicked, the display content of the list 11 is switched to the result list by the analysis method, and the list can be displayed by analysis method. The analysis result to be displayed in the list 11 can be specified by the list box 13. The display type includes "longest match number, overall match number, original frequency of file 1". You can specify the “number, number, all items” and manipulate the results displayed in the list 11. The designation is confirmed by clicking the redisplay button 14, and the display content of the list 11 is changed and redisplayed. The original frequency of each report is calculated according to the calculation formula (1- (number of matched grams / total number of grams)) × 100. The calculated original frequency is written in red after the file names listed in the list 11, for example. Here, the calculation formula for the original power is merely an example in the present embodiment, and the original power in the present invention is not limited to the calculation formula shown here.

リンク16をクリックすると、図９に示す詳細表示ページが表示される。このページには、左右に分割された領域に、一覧表11のファイル名１とファイル名２との組に該当する二つの文書ファイル７の内容が左右にそれぞれ対比表示される。一致した箇所（模倣部分）は色付きで強調表示される。一致している箇所には番号が振ってあり、左右の同じ番号がそれぞれ対応している。また、その番号をクリックすると、対応箇所にページ内リンクする。最長一致箇所は太文字で表示される。 When the link 16 is clicked, a detailed display page shown in FIG. 9 is displayed. In this page, the contents of the two document files 7 corresponding to the set of file name 1 and file name 2 in the list 11 are displayed in the left and right divided areas, respectively. The matched part (imitation part) is highlighted in color. Numbers that match are numbered, and the same numbers on the left and right correspond to each other. Clicking on the number will link to the corresponding location in the page. The longest match is displayed in bold.

図８の「グループ化して画像を表示」と書かれたグラフ表示ボタン15をクリックすると、図10のようなグラフ表示ページが表示される。これは図８において表示されている一覧表11における相関性を図示したものである。当該相関グラフ20は、模倣関係のある文書ファイル７の組の集合について、該集合に含まれる個々の文書ファイル７を節点とし、文書ファイル７間に模倣関係がある場合に該模倣関係のある文書ファイル７の節点間を枝により結び、該文書ファイル７間の模倣度に基づいて該枝の例えば長さ，太さ，色等の視覚的特徴を決定し、グラフとして表示したものである。相関グラフ20中に表示される記号・数値としては、例えば、節点としての濃い円：このフォルダのファイル，節点としての薄い円：比較対象に追加したフォルダのファイル，枝としての線：ファイルごとの結びつき，赤い文字：ファイル番号，水色の文字：全体の一致数などとして描画すればよく、文書ファイル７間の相関関係を表すパラメータやデザイン上の都合に応じて適宜変更される。 When the graph display button 15 labeled “Grouped and display image” in FIG. 8 is clicked, a graph display page as shown in FIG. 10 is displayed. This shows the correlation in the list 11 displayed in FIG. The correlation graph 20 shows a set of document files 7 having a mimicry relationship, and each document file 7 included in the set has a node as a node. The nodes of the file 7 are connected by branches, and visual features such as length, thickness, color, etc. of the branches are determined based on the degree of imitation between the document files 7 and displayed as a graph. Symbols and numerical values displayed in the correlation graph 20 include, for example, a dark circle as a node: a file in this folder, a thin circle as a node: a file in a folder added to the comparison target, a line as a branch: for each file It may be drawn as a connection, a red character: a file number, a light blue character: the total number of matches, etc., which are appropriately changed according to a parameter indicating a correlation between the document files 7 or a design convenience.

以上のように本第１実施例の模倣文書検出システム１では、複数の文書ファイル７を投入するための投入インタフェース部２と、投入された前記文書ファイル７を蓄積する文書蓄積部３と、入力された文書ファイルの組に対して模倣部分を検査し、当該検査結果を出力する模倣検査部４と、前記文書蓄積部３に蓄積された文書ファイル群から順次文書ファイルの組（文書ファイル７）を取り出して、前記模倣検査部４に入力し、該文書ファイルの組に対して前記模倣検査部４が出力する検査結果を保持する模倣検査駆動部５と、前記模倣検査駆動部５が保持している前記検査結果に基づいて、蓄積された前記文書ファイル７間の模倣関係を提示する結果表示部６とを具備する。 As described above, in the counterfeit document detection system 1 according to the first embodiment, the input interface unit 2 for inputting a plurality of document files 7, the document storage unit 3 for storing the input document files 7, and the input The imitation checking unit 4 that inspects the imitation part of the set of document files and outputs the inspection result, and the document file group (document file 7) sequentially from the document file group stored in the document storage unit 3 The imitation inspection driving unit 5 that holds the inspection result output from the imitation inspection unit 4 for the set of document files and is output to the imitation inspection unit 4 and the imitation inspection driving unit 5 holds And a result display unit 6 for presenting the imitation relationship between the stored document files 7 based on the inspection result.

このようにすると、システムに投入した複数の文書ファイル７間の模倣を検査して、模倣文書を検出することができる。従って、複数の文書間における類似部分を検出可能な模倣文書検出システムを提供することができる。 In this way, it is possible to detect imitation documents by inspecting imitation between a plurality of document files 7 input to the system. Therefore, it is possible to provide a counterfeit document detection system that can detect similar portions between a plurality of documents.

また本第１実施例の模倣文書検出システム１では、前記文書蓄積部３が、蓄積すべき前記文書ファイル７を分類できるものであり、前記模倣検査駆動部５が、前記文書蓄積部３における分類とユーザからの指示に基づいて、順次前記文書ファイルの組を前記文書蓄積部３から取り出して、前記模倣検査部４に入力し、該文書ファイルの組に対して前記模倣検査部４が出力する検査結果を保持するものであることを特徴とする。 Further, in the counterfeit document detection system 1 according to the first embodiment, the document storage unit 3 can classify the document file 7 to be stored, and the counterfeit inspection driving unit 5 performs classification in the document storage unit 3. Based on the instructions from the user, the document file sets are sequentially taken out from the document storage unit 3 and input to the imitation checking unit 4, and the imitation checking unit 4 outputs the document file sets. It is characterized by holding inspection results.

このようにすると、ユーザは模倣検査対象となる分類を自由に指定でき、当該分類に属する文書ファイル７について模倣検査を行なうことができる。従って、模倣検査対象を分類別に選択可能とすることで、検査効率を向上させることができる。 In this way, the user can freely specify the classification to be subjected to the imitation inspection, and can perform the imitation inspection on the document file 7 belonging to the classification. Therefore, the inspection efficiency can be improved by making it possible to select the imitation inspection target by classification.

さらに本第１実施例の模倣文書検出システム１では、前記結果表示部６が、模倣関係のある前記文書ファイルの組の文書ファイル名と、該文書ファイルの組に対する模倣度とを一覧として表形式で表示するとともに、該文書ファイルの組の一覧中における任意の文書ファイルの組について模倣部分を表示するものであることを特徴とする。 Furthermore, in the counterfeit document detection system 1 according to the first embodiment, the result display unit 6 uses a tabular format as a list of document file names of the document file sets having a counterfeit relationship and imitation levels for the document file sets. And an imitation part is displayed for an arbitrary set of document files in the list of sets of document files.

このようにすると、一覧表により模倣関係のある文書ファイル７を一目で確認することができ、文書ファイル７内のどの部分が模倣部分であるかを具体的に確認することができる。従って、ユーザに対して模倣検査結果をわかりやすく提示することができる。 In this way, the document file 7 having the imitation relationship can be confirmed at a glance from the list, and it can be specifically confirmed which portion in the document file 7 is the imitation portion. Therefore, the imitation inspection result can be presented to the user in an easy-to-understand manner.

また本第１実施例の模倣文書検出システム１では、前記結果表示部６が、模倣関係のある前記文書ファイルの組の集合について、該集合に含まれる個々の文書ファイルを節点とし、文書ファイル間に模倣関係がある場合に該模倣関係のある文書ファイルの節点間を枝により結び、該文書ファイル間の模倣度に基づいて該枝の視覚的特徴を決定し、グラフとして表示するものであることを特徴とする。 Further, in the counterfeit document detection system 1 of the first embodiment, the result display unit 6 uses the individual document files included in the set as a node for the set of sets of document files having a counterfeit relationship, When there is an imitation relationship, the nodes of the document files having the imitation relationship are connected by branches, the visual characteristics of the branches are determined based on the degree of imitation between the document files, and displayed as a graph It is characterized by.

このようにすると、文書ファイルの検査結果がグラフ化されるので、各文書ファイル間の模倣関係が視覚的に明確となり、ユーザが検査結果を容易に理解することができる。従って、ユーザに対して模倣検査結果をよりわかりやすく視覚的に提示することができる。 In this way, the inspection result of the document file is graphed, so that the imitation relationship between the document files becomes visually clear, and the user can easily understand the inspection result. Therefore, the imitation inspection result can be visually presented to the user in an easy-to-understand manner.

なお、本第１実施例は、コンピュータを、模倣文書検出システム１として機能させるためのプログラムで実現することもできる。 The first embodiment can also be realized by a program for causing a computer to function as the counterfeit document detection system 1.

図２は、第２実施例における模倣文書検出システムのシステム構成図である。模倣文書検出システム51は、投入インタフェース部２と、文書蓄積部３と、文書検索部57と、検索結果文書蓄積部59と、模倣検査部４と、模倣検査駆動部55と、結果表示部56とを具備して構成される。投入インタフェース部２，文書蓄積部３，模倣検査部４は第１実施例と略同様の構成であるが、本第２実施例では、文書投入インタフェース部２に投入される文書ファイル７が１つになっている。これは、本第２実施例における模倣文書検出システム51では、例えばインターネットなどの検索エンジンを利用して検索された検索結果文書ファイルとしての不特定多数のシステム外部文書ファイル57と、検査対象となる文書ファイル７との間で模倣検査が行なわれるためである。もちろん、これは、本特許において文書投入インタフェース部２から投入できる文書を１文書に限定するものではない。複数の文書が投入された場合には、該複数の文書を文書蓄積部３に蓄積した後、該文書蓄積部３から文書をひとつずつ取出しながら、本実施例で述べた動作を繰返せば良い。 FIG. 2 is a system configuration diagram of the counterfeit document detection system in the second embodiment. The counterfeit document detection system 51 includes an input interface unit 2, a document storage unit 3, a document search unit 57, a search result document storage unit 59, a counterfeit inspection unit 4, a counterfeit inspection drive unit 55, and a result display unit 56. And is configured. The input interface unit 2, the document storage unit 3, and the counterfeit inspection unit 4 have substantially the same configuration as in the first embodiment, but in the second embodiment, one document file 7 is input to the document input interface unit 2. It has become. This is because, in the counterfeit document detection system 51 in the second embodiment, for example, an unspecified number of system external document files 57 as search result document files searched using a search engine such as the Internet, and the inspection target. This is because imitation inspection is performed with the document file 7. Of course, this does not limit the number of documents that can be input from the document input interface unit 2 in this patent to one document. When a plurality of documents are input, after the plurality of documents are stored in the document storage unit 3, the operations described in this embodiment may be repeated while taking out the documents from the document storage unit 3 one by one. .

文書検索部57は、入力された文書ファイル７に記述された文章に基づいて、該文書に内容的に関連のあるシステム外部の電子的アクセス可能な文書となるシステム外部文書ファイル58を検索して、その結果の文書群を出力するものである。システム外部文書ファイル58としては、例えばインターネット上で公開されているＷｅｂページや電子文書ファイルなどが該当する。当該Ｗｅｂページ等は、厳密には、後述する検索結果文書蓄積部59において適当な識別子が付与された上でシステム外部文書ファイル58として保存されることとなる。文書検索部57は、既存技術を組み合わせることにより様々な構成で実現できる。例えば、インターネットのＷｅｂページ群を、該Ｗｅｂページに記述されたリンク情報をたどることにより横断的に収集し、該収集した個々のページと入力された文書ファイル７との類似度を、両文書間で一致する単語の出現頻度から計算して、類似度の高いＷｅｂページを結果として出力する構成や、入力された文書ファイル７の文章の中から出現頻度に基づいてその文書の主題となる単語群を抽出し、該単語群を検索語としてインターネット上でサービスが提供されているＷｅｂ文書全文検索サービスを利用して検索し、該検索エンジンが出力する検索結果を文書検索部57の検索結果として出力する構成等が考えられる。 Based on the text described in the input document file 7, the document search unit 57 searches the system external document file 58 that is an electronically accessible document outside the system that is related to the document in terms of content. The resulting document group is output. As the system external document file 58, for example, a Web page or an electronic document file published on the Internet corresponds. Strictly speaking, the Web page or the like is stored as a system external document file 58 after an appropriate identifier is given in a search result document storage unit 59 described later. The document search unit 57 can be realized in various configurations by combining existing technologies. For example, a group of web pages on the Internet is collected across the web by following the link information described in the web page, and the similarity between the collected individual pages and the input document file 7 is determined between the two documents. The web page having a high degree of similarity is output as a result, and the word group that is the subject of the document based on the appearance frequency from the text of the input document file 7 And a search result output from the search engine is output as a search result of the document search unit 57 using the Web document full-text search service provided on the Internet as a search term. The structure etc. to perform are considered.

検索結果文書蓄積部59は、文書検索部57で得られる検索結果の文書群としてのシステム外部文書ファイル58を、これらの個々の文書に対して識別可能な識別子を付与した上で蓄積するものであり、例えばハードディスクなどの記憶装置に保存されたデータベース等がこれに相当する。この検索結果文書蓄積部59により、システム外部文書ファイル58がシステム内で保存，管理される。とりわけ、検索結果文書蓄積部59は、システム外部文書ファイル58の保存に際して、ユーザからの指示に基づいて蓄積すべきシステム外部文書ファイル58を分類できるように構成されている。 The search result document storage unit 59 stores the system external document file 58 as a document group of the search results obtained by the document search unit 57 with an identifiable identifier assigned to each individual document. For example, a database stored in a storage device such as a hard disk corresponds to this. The search result document storage unit 59 saves and manages the system external document file 58 in the system. In particular, the search result document storage unit 59 is configured to be able to classify the system external document file 58 to be stored based on an instruction from the user when the system external document file 58 is stored.

模倣検査駆動部55は、検索結果文書蓄積部59からシステム外部文書ファイル58を順次一つずつ取り出して、投入されて文書蓄積部３に蓄積された文書ファイル７との組を作り、該文書ファイルの組（文書ファイル７とシステム外部文書ファイル58）を模倣検査部４に入力し、該文書ファイルの組に対して模倣検査部４が出力する検査結果を保持するものである。 The counterfeit inspection driving unit 55 sequentially extracts the system external document files 58 one by one from the search result document storage unit 59, creates a pair with the document file 7 that has been input and stored in the document storage unit 3, and the document file (A document file 7 and a system external document file 58) are input to the counterfeit inspection unit 4, and the inspection result output by the counterfeit inspection unit 4 for the set of document files is held.

結果表示部56は、模倣検査駆動部５が保持している模倣検査結果に基づいて、ユーザに対して、蓄積された文書ファイル７とシステム外部文書ファイル58間の模倣関係を提示するものであり、ユーザが実際に視認する検査結果表示画面等がこれに相当する。より詳細には、結果表示部６は、投入された文書ファイル７に関して模倣関係のあるシステム外部文書ファイル58の文書ファイル識別子と、システム外部文書ファイル58に対する模倣度とを一覧として表形式で表示するとともに、ユーザの操作により該文書ファイル識別子一覧中における任意のシステム外部文書ファイル58について投入された文書ファイル７との模倣部分を表示する。また、検査結果の別の表示態様として、模倣関係のある文書ファイル７とシステム外部文書ファイル58との組の集合について、該集合に含まれる個々の文書ファイル７，システム外部文書ファイル58を節点とし、文書ファイル７とシステム外部文書ファイル58間に模倣関係がある場合に、該模倣関係のある文書ファイル７，システム外部文書ファイル58の節点間を枝により結び、該文書ファイル７とシステム外部文書ファイル58間の模倣度に基づいて、該枝の例えば長さ，太さ，色等の視覚的特徴を決定し、グラフとして表示する。 The result display unit 56 presents the imitation relationship between the accumulated document file 7 and the system external document file 58 to the user based on the imitation inspection result held by the imitation inspection driving unit 5. An inspection result display screen or the like actually viewed by the user corresponds to this. More specifically, the result display unit 6 displays the document file identifier of the system external document file 58 having the imitation relationship with respect to the input document file 7 and the imitation degree for the system external document file 58 as a list in a table format. At the same time, the imitation part with the document file 7 input for any system external document file 58 in the document file identifier list by the user's operation is displayed. Further, as another display mode of the inspection result, for a set of the document file 7 having imitation relation and the system external document file 58, the individual document file 7 and the system external document file 58 included in the set are nodes. When there is a mimicry relationship between the document file 7 and the system external document file 58, the nodes of the document file 7 and the system external document file 58 having the mimicry relationship are connected by branches, and the document file 7 and the system external document file are connected. Based on the degree of imitation between 58, visual features such as length, thickness and color of the branch are determined and displayed as a graph.

以上のように本第２実施例の模倣文書検出システム51では、文書ファイル７を投入するための投入インタフェース部２と、該文書投入インターフェース部２を介して入力された前記文書ファイル７を蓄積する文書蓄積部３と、入力された前記文書ファイル７に記述された文章に基づいて検索語群を生成し、該生成した前記検索語群に基づいて、システム外部の電子的にアクセス可能な文書を検索してその検索結果の文書群を出力する文書検索部57と、該投入された前記文書ファイル７を前記文書検索部57に入力して得られる前記検索結果の文書群を個々の文書に対して識別可能な識別子を付与した上で検索結果文書ファイルに相当するシステム外部文書ファイル58として蓄積する検索結果文書蓄積部59と、入力された文書ファイルの組に対して模倣部分を検査し、当該検査結果を出力する模倣検査部４と、前記検索結果文書蓄積部59から前記システム外部文書ファイル58を順次取り出して、投入されて前記文書蓄積部３に蓄積された前記文書ファイル７との組を作り、該文書ファイルの組（文書ファイル７，システム外部文書ファイル58）を前記模倣検査部４に入力し、該文書ファイルの組に対して前記模倣検査部４が出力する検査結果を保持する模倣検査駆動部55と、前記模倣検査駆動部55が保持している前記検査結果に基づいて、投入された前記文書ファイル７と前記文書検索部57により検索された前記システム外部文書ファイル58との間の模倣関係を提示する結果表示部56とを具備する。 As described above, in the counterfeit document detection system 51 of the second embodiment, the input interface unit 2 for inputting the document file 7 and the document file 7 input via the document input interface unit 2 are stored. A search term group is generated based on the text described in the document storage unit 3 and the input document file 7, and an electronically accessible document outside the system is generated based on the generated search term group. A document search unit 57 for searching and outputting a document group of the search result, and the document group of the search result obtained by inputting the input document file 7 to the document search unit 57 for each document A search result document storage unit 59 that stores an identifier that can be identified and stored as a system external document file 58 corresponding to the search result document file, and a set of input document files. The imitation inspection unit 4 that inspects the copied portion and outputs the inspection result, and the system external document file 58 is sequentially taken out from the search result document storage unit 59, and is input and stored in the document storage unit 3 A set with the document file 7 is created, and the set of the document file (document file 7, system external document file 58) is input to the imitation checking unit 4, and the imitation checking unit 4 outputs the set of the document files. The imitation inspection driving unit 55 for holding the inspection result to be performed, and the system searched by the input document file 7 and the document searching unit 57 based on the inspection result held by the imitation inspection driving unit 55 And a result display unit 56 for presenting the imitation relationship with the external document file 58.

このようにすると、システムに投入した検査対象となる文書ファイル７と、システム外部から検索された不特定多数の関連文書との間における模倣を検査することにより、投入した文書ファイル７が模倣文書であるか否かをより広い文書範囲で判定することができる。従って、システム外部の関連文書を用いて、より広い文書範囲における模倣を検出可能な模倣文書検出システムを提供することができる。 In this way, by checking imitation between the document file 7 to be inspected input to the system and a large number of unspecified related documents retrieved from outside the system, the input document file 7 is a counterfeit document. It is possible to determine whether or not there is a wider document range. Therefore, it is possible to provide a counterfeit document detection system capable of detecting imitation in a wider document range using related documents outside the system.

また本第２実施例の模倣文書検出システム51では、前記結果表示部56が、投入された前記文書ファイル７と模倣関係のある前記システム外部文書ファイル58の文書ファイル識別子と、該システム外部文書ファイル58に対する模倣度とを一覧として表形式で表示するとともに、該文書ファイル識別子一覧中における任意の前記システム外部文書ファイル58について、投入された前記文書ファイル７との模倣部分を表示するものであることを特徴とする。 In the counterfeit document detection system 51 of the second embodiment, the result display unit 56 includes the document file identifier of the system external document file 58 having a counterfeit relationship with the input document file 7, and the system external document file. The imitation level for 58 is displayed as a list in a table format, and the imitation part of the input document file 7 is displayed for any of the system external document files 58 in the document file identifier list. It is characterized by.

このようにすると、一覧表により模倣関係のある文書ファイルを一目で確認することができ、文書ファイル内のどの部分が模倣部分であるかを具体的に確認することができる。従って、ユーザに対して模倣検査結果をわかりやすく提示することができる。 In this way, it is possible to confirm at a glance a document file having a counterfeit relationship from the list, and it is possible to specifically confirm which part in the document file is the imitation part. Therefore, the imitation inspection result can be presented to the user in an easy-to-understand manner.

図３は、第３実施例における模倣文書検出システムのシステム構成図である。本第３実施例では、第１実施例と第２実施例とを組み合わせたシステムを提供する。すなわち、模倣文書検出システム61は、投入インタフェース部２と、文書蓄積部３と、文書検索部57と、検索結果文書蓄積部59と、模倣検査部４と、模倣検査駆動部65と、結果表示部66とを具備して構成される。投入インタフェース部２，文書蓄積部３，模倣検査部４は第１実施例のものと、文書検索部57，検索結果文書蓄積部59は第２実施例のものと、それぞれ略同様の構成である。 FIG. 3 is a system configuration diagram of the counterfeit document detection system in the third embodiment. In the third embodiment, a system in which the first embodiment and the second embodiment are combined is provided. That is, the counterfeit document detection system 61 includes an input interface unit 2, a document storage unit 3, a document search unit 57, a search result document storage unit 59, a counterfeit inspection unit 4, a counterfeit inspection drive unit 65, and a result display. And 66. The input interface unit 2, the document storage unit 3, and the counterfeit inspection unit 4 have substantially the same configurations as those of the first embodiment, and the document search unit 57 and the search result document storage unit 59 have substantially the same configurations as those of the second embodiment. .

模倣検査駆動部65は、文書蓄積部３に蓄積された文書ファイル群（文書ファイル７）及び検索結果文書蓄積部59に蓄積された文書群（システム外部文書ファイル58）の中から順次二つの文書ファイルの組（文書ファイル７の組又は文書ファイル７とシステム外部文書ファイル58との組）を取り出して、模倣検査部４に入力し、該文書ファイルの組に対して模倣検査部４が出力する検査結果を保持するものである。 The counterfeit inspection driving unit 65 sequentially selects two documents from the document file group (document file 7) stored in the document storage unit 3 and the document group (system external document file 58) stored in the search result document storage unit 59. A set of files (a set of the document file 7 or a set of the document file 7 and the system external document file 58) is taken out and input to the imitation checking unit 4, and the imitation checking unit 4 outputs the set of document files. The inspection result is held.

結果表示部66は、模倣検査駆動部５が保持している模倣検査結果に基づいて、ユーザに対して、蓄積された文書ファイル７間、及び文書ファイル７とシステム外部文書ファイル58間の模倣関係を提示するものであり、ユーザが実際に視認する検査結果表示画面等がこれに相当する。より詳細には、結果表示部６は、模倣関係にある前記文書ファイルの組の文書ファイル名又は文書ファイル識別子と、該文書ファイルの組に対する模倣度とを一覧として表形式で表示するとともに、ユーザの操作により該文書ファイルの組の一覧中における任意の文書ファイルの組について模倣部分を表示する。また、検査結果の別の表示態様として、模倣関係のある該文書ファイルの組の集合について、当該集合に含まれる個々の文書ファイル７，システム外部文書ファイル58を節点とし、該文書ファイルの組の間に模倣関係がある場合に、該模倣関係のある文書ファイル７，システム外部文書ファイル58の節点間を枝により結び、該文書ファイルの組の間の模倣度に基づいて、該枝の例えば長さ，太さ，色等の視覚的特徴を決定し、グラフとして表示する。 The result display unit 66 determines the imitation relationship between the stored document files 7 and between the document file 7 and the system external document file 58 based on the imitation inspection result held by the imitation inspection driving unit 5. The inspection result display screen or the like that the user actually visually recognizes corresponds to this. More specifically, the result display unit 6 displays the document file name or document file identifier of the set of document files in the imitation relationship and the imitation level for the set of document files in a tabular form as a list, As a result of the above operation, an imitation portion is displayed for an arbitrary document file set in the list of document file sets. Further, as another display mode of the inspection result, with respect to a set of the document file set having the imitation relationship, the individual document file 7 and the system external document file 58 included in the set are set as nodes, and the set of the document file is set. When there is an imitation relationship between the nodes of the document file 7 and the system external document file 58 having the imitation relationship, the nodes are connected by a branch, and the length of the branch is determined based on the imitation degree between the document file sets. Visual characteristics such as thickness, thickness, and color are determined and displayed as a graph.

以上のように本第３実施例の模倣文書検出システム61では、複数の文書ファイル７を投入するための投入インタフェース部２と、投入された前記文書ファイル７を蓄積する文書蓄積部３と、入力された前記文書ファイル７に記述された文章に基づいて検索語群を生成し、該生成した前記検索語群に基づいて、外部の電子的にアクセス可能な文書を検索してその検索結果の文書群を出力する文書検索部57と、前記文書蓄積部３に蓄積された個々の前記文書ファイル７に対して、該文書ファイル７を前記文書検索部57に入力して得られる前記検索結果の文書群を個々の文書に対して識別可能な識別子を付与した上で検索結果文書ファイルに相当するシステム外部文書ファイル58として蓄積する検索結果文書蓄積部59と、入力された文書ファイルの組に対して模倣部分を検査し、当該検査結果を出力する模倣検査部４と、前記文書蓄積部３に蓄積された前記文書ファイル群及び前記検索結果文書蓄積部59に蓄積された検索結果文書ファイル群の中から順次文書ファイルの組（文書ファイル７，システム外部文書ファイル58）を取り出して、前記模倣検査部４に入力し、該文書ファイルの組に対して前記模倣検査部４が出力する検査結果を保持する模倣検査駆動部65と、前記模倣検査駆動部65が保持している前記検査結果に基づいて、蓄積された前記文書ファイル７間及び投入された前記文書ファイル７と前記文書検索部57により検索された前記システム外部文書ファイル58との間の模倣関係を提示する結果表示部66とを具備する。 As described above, in the counterfeit document detection system 61 of the third embodiment, the input interface unit 2 for inputting a plurality of document files 7, the document storage unit 3 for storing the input document files 7, and the input A search word group is generated based on the written text described in the document file 7, and an external electronically accessible document is searched based on the generated search word group, and the search result document A document search unit 57 for outputting a group, and the search result document obtained by inputting the document file 7 to the document search unit 57 for each document file 7 stored in the document storage unit 3 A search result document storage unit 59 for storing a group as a system external document file 58 corresponding to a search result document file after assigning an identifiable identifier to each document, and a set of input document files The imitation inspection unit 4 that inspects the imitation part and outputs the inspection result, the document file group stored in the document storage unit 3, and the search result document file group stored in the search result document storage unit 59 A set of document files (document file 7, system external document file 58) is sequentially extracted from the list, input to the imitation inspection unit 4, and the inspection result output by the imitation inspection unit 4 for the set of document files Based on the inspection result held by the counterfeit inspection driving unit 65 and between the stored document files 7 and the input document file 7 and the document search unit 57 And a result display unit 66 for presenting the imitation relationship with the system external document file 58 retrieved by the above.

このようにすると、システムに投入した複数の文書ファイル間の模倣を検査すると共に、システムに投入した文書ファイル７と、システム外部から検索された不特定多数の関連文書との間における模倣を検査することにより、投入した文書ファイル７が模倣文書であるか否かをさらに広い文書範囲で判定することができる。従って、複数の文書間における類似部分を検出可能、かつシステム外部の関連文書を用いて、より広い文書範囲における模倣を検出可能な模倣文書検出システムを提供することができる。 In this manner, imitation between a plurality of document files input to the system is inspected, and imitation between the document file 7 input to the system and an unspecified number of related documents retrieved from outside the system is inspected. Thus, it is possible to determine whether or not the input document file 7 is a counterfeit document in a wider document range. Therefore, it is possible to provide a counterfeit document detection system capable of detecting a similar portion between a plurality of documents and detecting imitation in a wider document range by using a related document outside the system.

また本第３実施例の模倣文書検出システム61では、前記文書蓄積部３及び前記検索結果文書蓄積部59が、蓄積すべき前記文書ファイル７及び前記システム外部文書ファイル58を分類できるものであり、前記模倣検査駆動部65が、前記文書蓄積部３及び前記検索結果文書蓄積部59における分類とユーザからの指示に基づいて、順次前記文書ファイルの組を前記文書蓄積部３及び前記検索結果文書蓄積部59から取り出して、前記模倣検査部４に入力し、該文書ファイルの組に対して前記模倣検査部４が出力する検査結果を保持するものであることを特徴とする。 In the counterfeit document detection system 61 of the third embodiment, the document storage unit 3 and the search result document storage unit 59 can classify the document file 7 and the system external document file 58 to be stored, The counterfeit inspection driving unit 65 sequentially sets the document file sets to the document storage unit 3 and the search result document storage based on the classification in the document storage unit 3 and the search result document storage unit 59 and an instruction from the user. It is extracted from the unit 59, input to the counterfeit inspection unit 4, and holds the inspection result output by the counterfeit inspection unit 4 for the set of document files.

このようにすると、ユーザは模倣検査対象となる分類を自由に指定でき、当該分類に属する文書ファイルについて模倣検査を行なうことができる。従って、模倣検査対象を分類別に選択可能とすることで、検査効率を向上させることができる。 In this way, the user can freely specify the classification to be subjected to the imitation inspection, and can perform the imitation inspection on the document file belonging to the classification. Therefore, the inspection efficiency can be improved by making it possible to select the imitation inspection target by classification.

さらに本実施例の模倣文書検出システム61では、前記結果表示部66が、模倣関係のある前記文書ファイルの組の文書ファイル名又は文書ファイル識別子と、該文書ファイルの組に対する模倣度とを一覧として表形式で表示するとともに、該文書ファイルの組の一覧中における任意の文書ファイルの組について模倣部分を表示するものであることを特徴とする。 Further, in the counterfeit document detection system 61 of the present embodiment, the result display unit 66 lists the document file name or document file identifier of the set of document files having a counterfeit relationship, and the imitation level for the set of document files. In addition to displaying in tabular form, the imitation part is displayed for an arbitrary set of document files in the list of sets of document files.

なお、本発明は、上記実施例に限定されるものではなく、本発明の趣旨を逸脱しない範囲で変更可能である。 In addition, this invention is not limited to the said Example, It can change in the range which does not deviate from the meaning of this invention.

本発明の第１実施例における模倣文書検出システムのシステム構成を示すブロック図である。It is a block diagram which shows the system configuration | structure of the imitation document detection system in 1st Example of this invention. 本発明の第２実施例における模倣文書検出システムのシステム構成を示すブロック図である。It is a block diagram which shows the system configuration | structure of the imitation document detection system in 2nd Example of this invention. 本発明の第３実施例における模倣文書検出システムのシステム構成を示すブロック図である。It is a block diagram which shows the system configuration | structure of the imitation document detection system in 3rd Example of this invention. ドキュメントＡ，Ｂについての文字n-gram解析により作成されたn-gram集合を示す説明図である。It is explanatory drawing which shows the n-gram set produced by the character n-gram analysis about documents A and B. FIG. 図４における一致検出結果のマトリックスを示す説明図である。It is explanatory drawing which shows the matrix of the coincidence detection result in FIG. 図４におけるn-gram集合を辞書順にソートしたインバーテッド・リストを示す説明図である。FIG. 5 is an explanatory diagram showing an inverted list in which the n-gram set in FIG. 4 is sorted in dictionary order. ドキュメントＡ，Ｂについての単語n-gram解析により作成されたn-gram集合を示す説明図である。It is explanatory drawing which shows the n-gram set produced by the word n-gram analysis about documents A and B. FIG. 本発明の第１実施例における模倣文書検出システムによる解析結果画面を示す説明図である。It is explanatory drawing which shows the analysis result screen by the imitation document detection system in 1st Example of this invention. 同上、解析結果の詳細表示画面を示す説明図である。It is explanatory drawing which shows a detailed display screen of an analysis result same as the above. 同上、解析結果の相関グラフ表示画面を示す説明図である。It is explanatory drawing which shows the correlation graph display screen of an analysis result same as the above.

Explanation of symbols

１模倣文書検出システム
２投入インタフェース部
３文書蓄積部
４模倣検査部
５模倣検査駆動部
６結果表示部
７文書ファイル
51 模倣文書検出システム
55 模倣検査駆動部
56 結果表示部
57 文書検索部
58 システム外部文書ファイル（検索結果文書ファイル）
59 検索結果文書蓄積部
61 模倣文書検出システム
65 模倣検査駆動部
66 結果表示部 DESCRIPTION OF SYMBOLS 1 Imitation document detection system 2 Input interface part 3 Document storage part 4 Imitation inspection part 5 Imitation inspection drive part 6 Result display part 7 Document file
51 Counterfeit document detection system
55 Imitation inspection drive
56 Result display area
57 Document Search Department
58 System external document file (Search result document file)
59 Search result document storage
61 Counterfeit document detection system
65 Imitation inspection drive
66 Result display area

Claims

An input interface unit for inputting a plurality of document files, a document storage unit for storing the input document files,
A counterfeit inspection unit that inspects a counterfeit portion against a set of input document files and outputs the inspection result;
Imitation in which a set of document files is sequentially extracted from the document file group stored in the document storage unit, input to the imitation inspection unit, and the inspection result output by the imitation inspection unit for the set of document files is retained An inspection drive unit;
A counterfeit document detection system comprising: a result display unit that presents a counterfeit relationship between the stored document files based on the test result held by the counterfeit test drive unit.

The document storage unit can classify the document files to be stored, and the imitation inspection driving unit sequentially sets the document file sets based on the classification in the document storage unit and instructions from the user. 2. The counterfeit document detection according to claim 1, wherein the counterfeit document detection unit retrieves from the storage unit, inputs to the counterfeit inspection unit, and holds the inspection result output by the counterfeit inspection unit for the set of document files. system.

The result display unit displays the document file name of the set of document files having the imitation relationship and the imitation degree for the set of document files in a tabular form as a list, and an arbitrary one in the list of the set of document files The counterfeit document detection system according to claim 1 or 2, wherein the counterfeit portion is displayed for a set of document files.

A submission interface for submitting document files;
A document storage unit for storing the document file input via the document input interface unit;
A search word group is generated based on the text described in the input document file, and an electronically accessible document outside the system is searched based on the generated search word group. A document search unit for outputting a document group;
A search result document that stores the search result document group obtained by inputting the input document file into the document search unit, with an identifier that can be identified for each document, and stored as a search result document file A storage unit;
A counterfeit inspection unit that inspects a counterfeit portion against a set of input document files and outputs the inspection result;
The retrieval result document file is sequentially taken out from the retrieval result document storage unit, and a pair is formed with the document file that is input and stored in the document storage unit, and the set of document files is input to the imitation inspection unit. , A counterfeit inspection driving unit that holds an inspection result output by the counterfeit inspection unit for the set of document files;
A result display unit that presents a mimicry relationship between the input document file and the search result document file searched by the document search unit based on the inspection result held by the counterfeit inspection drive unit; A counterfeit document detection system comprising:

The result display unit displays the document file identifier of the search result document file having a counterfeit relationship with the input document file and the imitation degree for the search result document file in a tabular form, and the document file 5. The counterfeit document detection system according to claim 4, wherein a counterfeit part of the arbitrary search result document file in the identifier list with the input document file is displayed.

A submission interface for submitting multiple document files;
A document storage unit for storing the input document file;
A search term group is generated based on the text described in the input document file, and an external electronically accessible document is searched based on the generated search term group, and the search result document A document search unit for outputting groups,
For each of the document files stored in the document storage unit, an identifier capable of identifying the document group of the search result obtained by inputting the document file to the document search unit is assigned to each document file. And a search result document storage unit for storing as a search result document file,
A counterfeit inspection unit that inspects a counterfeit portion against a set of input document files and outputs the inspection result;
A set of document files is sequentially extracted from the document file group stored in the document storage unit and the search result document file group stored in the search result document storage unit, and is input to the imitation checking unit, and the document A counterfeit inspection driving unit that holds the inspection result output by the counterfeit inspection unit for a set of files;
Imitation between the stored document files and between the input document file and the search result document file searched by the document search unit based on the inspection result held by the counterfeit inspection driving unit A counterfeit document detection system comprising a result display unit for presenting a relationship.

The document storage unit and the search result document storage unit can classify the document file and the search result document file to be stored, and the imitation inspection driving unit includes the document storage unit and the search result document storage unit. The document file sets are sequentially extracted from the document storage unit and the search result document storage unit based on the classification and the instruction from the user and input to the imitation checking unit. The counterfeit document detection system according to claim 6, wherein the counterfeit inspection unit outputs an inspection result output by the counterfeit inspection unit.

The result display unit displays the document file name or the document file identifier of the document file set having the imitation relationship and the imitation degree for the document file set as a list in a tabular form. 8. The counterfeit document detection system according to claim 6, wherein the counterfeit part is displayed for an arbitrary set of document files in the list.

When the result display unit has a set of document files having a counterfeit relationship, each document file included in the set is a node, and when there is a counterfeit relationship between the document files, The node is connected by a branch, the visual feature of the branch is determined based on the degree of imitation between the document files, and displayed as a graph. The imitation document detection system described.

A program for causing a computer to function as the imitation document detection system according to any one of claims 1 to 9.