JP2000035964A

JP2000035964A - Relevance calculating apparatus, storage medium storing relevance calculating program, and information retrieval system

Info

Publication number: JP2000035964A
Application number: JP10203470A
Authority: JP
Inventors: Nobuyuki Igata; 伸之井形
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1998-07-17
Filing date: 1998-07-17
Publication date: 2000-02-02

Abstract

(57)【要約】【課題】処理負担の増大を抑えながら、共起関係を考
慮した関連度を得ることが可能な関連度計算装置を提供
する。【解決手段】２つの文書情報相互の関連度を一方の文
書情報によって指定される複数の属性に着目して算出す
る関連度算出装置において、複数の属性それぞれについ
ての２つの文書情報の関連度を示す部分関連度を評価す
る部分関連度評価手段１１１と、各属性について得られ
た評価値を積算する積算手段１１２と、複数の属性の中
から、他方の文書情報に共通して含まれている属性の組
み合わせを検出する検出手段１１３と、検出手段１１３
によって検出された複数の属性の組み合わせそれぞれに
ついて、共起関係による関連度の寄与分として所定の定
数を積算手段１１２による積算結果に加算する寄与分加
算手段１１４とを備える。 (57) [Summary] [Problem] To provide a relevance calculating apparatus capable of obtaining a relevance in consideration of a co-occurrence relationship while suppressing an increase in a processing load. In a relevance calculating apparatus that calculates the relevance of two pieces of document information by focusing on a plurality of attributes specified by one piece of document information, the relevance of two pieces of document information for each of a plurality of attributes is calculated. The partial relevance evaluation unit 111 that evaluates the partial relevance indicated, the integration unit 112 that integrates the evaluation values obtained for each attribute, and the plurality of attributes are commonly included in the other document information. Detecting means 113 for detecting a combination of attributes; detecting means 113
And a contribution adding unit 114 that adds a predetermined constant to the integration result by the integration unit 112 as a contribution of the degree of association due to the co-occurrence relation for each of the combinations of the plurality of attributes detected by.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、大規模な文書デー
タベースから所望の文書を検索するための情報検索シス
テムなどにおいて、指定された２つの文書情報の間の関
連度を求める関連度算出装置に関するものである。近年
では、文書情報の電子化が進み、膨大な数の文書情報が
文書データベースに蓄積されるようになっており、ま
た、文書データベースからの情報検索システムには、全
文検索技術が適用されるようになってきている。このた
め、問い合わせ命令によって指定したキーワードの組み
合わせによっては、情報検索システムにより、多数の文
書が検索結果として出力される可能性があり、問い合わ
せ命令と検索された各文書とが適合している度合いを示
す指標に基づいて、検索結果をランキングする必要性が
ある。また一方、文書作成支援システムなどでは、長大
なテキストを自動的に段落分けするための指標として、
注目している文章の一部と他の部分との類似度を求める
ことが必要とされている。このような指標を求めるため
に、２つの文書情報の関連度を迅速かつ正確に評価する
技術が必要とされている。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a relevancy calculating apparatus for obtaining a relevancy between two specified document information in an information retrieval system for retrieving a desired document from a large-scale document database. Things. In recent years, the digitization of document information has been progressing, and a huge number of document information has been accumulated in a document database. In addition, a full-text search technology has been applied to an information retrieval system from a document database. It is becoming. For this reason, depending on the combination of keywords specified by the query command, a large number of documents may be output as search results by the information search system, and the degree of matching between the query command and each searched document may be determined. There is a need to rank search results based on the indicators shown. On the other hand, in document creation support systems, etc., as an index for automatically dividing large texts into paragraphs,
There is a need to determine the similarity between a part of a sentence of interest and another part. In order to obtain such an index, a technique for quickly and accurately evaluating the degree of relevance between two pieces of document information is required.

【０００２】[0002]

【従来の技術】検索結果のランキングを行う情報検索シ
ステム（以下、単に情報検索システムと称する）では、
利用者が指定した検索式に含まれる少なくとも１つの検
索文字列の集合Ｑを一方の文書情報とし、文書データベ
ース内の文書ｉを他方の文書情報として、これらの文書
情報の関連度Ｓｉを集合Ｑの要素ｊに関する寄与分Ｗ_ij
の総和として求めている（式(１)参照）。2. Description of the Related Art In an information search system for ranking search results (hereinafter simply referred to as an information search system),
A set Q of at least one search character string included in the search formula specified by the user is set as one piece of document information, a document i in the document database is set as the other piece of document information, and the relevance Si of these pieces of document information is set Q Contribution W _ij for element j of
(See equation (1)).

【０００３】[0003]

【数１】この寄与分Ｗ_ijを評価するための評価関数ｆ（ｉ，ｊ）
の最も単純な例は、式(２)に示すように、検索文字列ｊ
が文書ｉに出現したか否かに応じて、数値「１」と数値
「０」とを割り当てるものである。(Equation 1) Evaluation function f (i, j) for evaluating this contribution W _ij
Is the simplest example, as shown in equation (2).
Is assigned a numerical value “1” and a numerical value “0” according to whether or not has appeared in the document i.

【数２】その他、評価関数ｆ（ｉ，ｊ）としては様々なものが提
案されており、一般には、文書ｉにおける検索文字列ｊ
の出現頻度(text frequency)tf_ijや、文書データベース
に登録された全文書における検索文字列ｊを含む出現文
書数(documentfrequency)df_j などの統計的情報をパラ
メータを利用するものが多い。(Equation 2) Various other evaluation functions f (i, j) have been proposed. Generally, a search character string j in a document i
In many cases, statistical information such as the appearance frequency (text frequency) tf _{ij of} the document and the number of appearing documents (document frequency) df _j including the search character string j in all the documents registered in the document database are used as parameters.

【０００４】このような統計的情報をパラメータとして
用いた評価関数ｇ(tf_ij，df_j )の例としては、式(３)に
示すように、出現文書数df_j の逆数に全文書数Ｎを乗じ
て得られるパラメータidf_j(inverse document frequenc
y)を用いて表されるものが簡単である。As an example of an evaluation function g (tf _ij , df _j ) using such statistical information as a parameter, as shown in equation (3), the reciprocal of the number of appearing documents df _j is the total number of documents N Multiplied by idf _j (inverse document frequenc
What is expressed using y) is simple.

【数３】実際の計算では、情報単位を合わせるために、各パラメ
ータの対数を用いる場合が多い。(Equation 3) In actual calculations, the logarithm of each parameter is often used to match the information units.

【０００５】なお、統計的情報をパラメータとする評価
関数の詳細については、「 JustinZobel,Alistair Moff
at:Similarity Measures Explored,ACM SIGMOD Confere
nce,1995」および「海野敏：出現頻度情報に基づく単
語重み付けの原理、Libraryand information Science,N
o.26,1988」を参照されたい。図７に、関連度計算方法
を適用してランキングを行う情報検索システムの構成例
を示す。また、図８に、ランキング処理動作を表す流れ
図を示す。For details of the evaluation function using statistical information as a parameter, see “Justin Zobel, Alistair Moff
at: Similarity Measures Explored, ACM SIGMOD Confere
nce, 1995 "and" Umino Satoshi: Principle of Word Weighting Based on Appearance Frequency Information, Libraryand information Science, N
o.26, 1988 ". FIG. 7 shows a configuration example of an information search system that performs ranking by applying the relevance calculation method. FIG. 8 is a flowchart showing the ranking processing operation.

【０００６】図７において、利用者がキーボード４０１
を操作して入力した検索式は、受付処理部４１１を介し
て検索制御部４１２に渡され、更に、データベース管理
システム（ＤＢＭＳ）４０２による文書データベース４
０３の検索処理に供される。In FIG. 7, a user enters a keyboard 401
Is input to the search control unit 412 via the reception processing unit 411, and is further input to the document database 4 by the database management system (DBMS) 402.
03 is provided for search processing.

【０００７】また、図７に示した演算処理部４１３は、
検索制御部４１２からの指示に応じて、指定された検索
文字列ｊについての文書ｉの部分関連度Ｗ_ijを算出し、
積算処理部４１４は、得られた部分関連度Ｗ_ijを演算処
理部４１３から受け取り、バッファ４１５を利用して、
各文書に対応する文書IDごとに積算する構成となってい
る。The arithmetic processing unit 413 shown in FIG.
In response to an instruction from the search control unit 412, the partial relevance W _ij of the document i for the specified search character string j is calculated,
The integration processing unit 414 receives the obtained partial relevance W _ij from the arithmetic processing unit 413 and uses the buffer 415 to
It is configured to accumulate for each document ID corresponding to each document.

【０００８】このとき、データベース管理システム４０
２は、検索制御部４１２から受け取った検索文字列ｊを
キーワードとしてデータベース４０３を検索し、この検
索文字列ｊを含む文書を示す文書IDの集合を検索制御部
４１２を介して演算処理部４１３に渡せばよい。この演
算処理部４１３は、図８のステップ５０１に示すよう
に、検索文字列ｊを含む文書IDの列とともに統計的パラ
メータtf_ij、idf_jを受け取り、例えば、式(３)に示した
評価関数ｇ(tf_ij，df_j )を用いて、指定された文書ｉに
ついての部分関連度Ｗ_ijを算出し（ステップ５０２）、
図７に示した積算処理部４１４に送出すればよい。At this time, the database management system 40
2 searches the database 403 using the search character string j received from the search control unit 412 as a keyword, and sends a set of document IDs indicating documents including the search character string j to the arithmetic processing unit 413 via the search control unit 412. Just pass it. The arithmetic processing unit 413 receives the statistical parameters tf _ij and idf _j together with the document ID string including the search character string j as shown in step 501 of FIG. Using g (tf _ij , df _j ), a partial relevance W _ij is calculated for the specified document i (step 502).
What is necessary is just to send to the integration process part 414 shown in FIG.

【０００９】これに応じて、積算処理部４１４は、ま
ず、受け取った部分関連度Ｗ_ijに対応する文書IDの部分
関連度が既にバッファ４１５に保持されているか否かを
判定する（ステップ５０３）。このステップ５０３の肯
定判定の場合は、バッファ４１５に保持されている該当
する部分関連度Ｗ_ijに新たに算出された部分関連度Ｗ_ij
を加算し（ステップ５０４）、否定判定の場合は、該当
する文書IDに対応する部分関連度Ｗ_ijとして新規にバッ
ファ４１５に格納する（ステップ５０５）。In response, the integration processing unit 414 first determines whether or not the partial relevance of the document ID corresponding to the received partial relevance W _ij is already stored in the buffer 415 (step 503). . If affirmative determination in step 503, the relevant portions relevance W _ij newly computed portions relevance W _ij are held in the buffer 415
Is added (step 504), and in the case of a negative determination, the partial relevance W _ij corresponding to the corresponding document ID is newly stored in the buffer 415 (step 505).

【００１０】また、ステップ５０４またはステップ５０
５の終了後に、検索制御部４１２は、ステップ５０６に
おいて、文書IDの列で示された全ての文書ｉについて、
部分関連度Ｗ_ijの算出が終了したか否かを判定し、否定
判定の場合は、ステップ５０２に戻って次の文書ｉにつ
いての部分関連度Ｗ_ijを算出すればよい。Step 504 or step 50
After the end of Step 5, the search control unit 412 determines, in Step 506, all the documents i indicated by the document ID column.
It is determined whether the calculation of the partial relevance W _ij is completed. If the determination is negative, the process returns to step 502 to calculate the partial relevance W _ij for the next document i.

【００１１】このようにして、文書IDの列で示された全
ての文書について部分関連度を算出したときに、ステッ
プ５０６の肯定判定となり、ステップ５０７において、
検索制御部４１２は、検索式で指定された全ての検索文
字列についての処理が終了したか否かを判定する。この
ステップ５０７の否定判定の場合に、検索制御部４１２
は、データベース管理システム４０２を介して新しい検
索文字列をキーワードとしてデータベース４０３を検索
するとともに、この検索文字列を演算処理部４１３に指
定してステップ５０１に戻ればよい。In this way, when the partial relevance is calculated for all the documents indicated by the document ID column, an affirmative determination is made in step 506, and in step 507,
The search control unit 412 determines whether or not the processing has been completed for all the search character strings specified by the search formula. In the case of a negative determination in step 507, the search control unit 412
May be performed by searching the database 403 using the new search character string as a keyword via the database management system 402, specifying the search character string in the arithmetic processing unit 413, and returning to step 501.

【００１２】このようにして、ステップ５０７の肯定判
定となるまで、上述したステップ５０１からステップ５
０７の処理を繰り返すことにより、検索文字列ｊについ
ての文書ｉの部分関連度Ｗ_ijを算出し、バッファ４１５
に文書IDごとに保持された積算結果として、文書ごとの
関連度Ｓi を求めることができる。したがって、図７に
示したソート処理部４１６は、検索制御部４１２からの
指示に応じて、このバッファ４１５に格納された関連度
Ｓi に基づいて文書IDをソートして（ステップ５０
８）、ランキング処理を終了すればよい。In this manner, steps 501 through 5 described above are repeated until the affirmative determination in step 507 is made.
07 is repeated to calculate the partial relevance W _ij of the document i for the search character string j, and the buffer 415
The relevance Si of each document can be obtained as the integration result held for each document ID. Therefore, the sort processing unit 416 shown in FIG. 7 sorts the document IDs based on the relevance Si stored in the buffer 415 according to the instruction from the search control unit 412 (step 50).
8) The ranking process may be terminated.

【００１３】これにより、検索式に応じてデータベース
４０３から抽出された多数の文書を関連度Ｓに応じてラ
ンキングし、このソート結果を表示処理部４０４を介し
てＣＲＴディスプレイ装置（ＣＲＴ）４０５による表示
処理に供することにより、検索結果として得られた多数
の文書から重要と思われるものを選択する際の指標とし
て、利用者に提供することができる。Thus, a number of documents extracted from the database 403 in accordance with the retrieval formula are ranked according to the relevance S, and the sorting result is displayed on the CRT display device (CRT) 405 via the display processing unit 404. By providing the processing, it can be provided to the user as an index when selecting a document considered important from a large number of documents obtained as a search result.

【００１４】しかしながら、個々の検索文字列の出現頻
度などの統計的情報を式(３)のように単純に積算して得
られた関連度と、利用者が、検索結果として得られた文
書が重要であるか否かを主観的に評価する際の指標とな
る重要度とは必ずしも一致しない。このため、統計的情
報に基づいて、人間が主観的に評価する文書の重要度に
近似した値の関連度を求めるために、検索文字列個々の
出現頻度に加えて、複数の検索文字列の組み合わせにつ
いて、要素となる検索文字列がともに出現する関係（以
下、共起関係と称する）を有していることを考慮して関
連度を求める評価式が提案されている（高木徹、木谷
強：単語共起関係を用いた文書重要度付与の検討、情
報処理学会研究報告書、96-FI-41,pp.61-68,1996）。However, the relevance obtained by simply integrating statistical information such as the appearance frequency of each search character string as shown in Expression (3), and the user obtains the document obtained as a search result The degree of importance does not always match the degree of importance, which serves as an index when subjectively evaluating whether or not important. Therefore, based on statistical information, in order to determine the relevance of a value that approximates the importance of a document that is subjectively evaluated by a human, in addition to the appearance frequency of each search string, An evaluation formula has been proposed for obtaining a degree of relevance in consideration of a combination in which a search character string as an element appears together (hereinafter referred to as a co-occurrence relation) (Tetsu Takagi, Tsuyoshi Kitani) : Examination of document importance assignment using word co-occurrence relation, IPSJ research report, 96-FI-41, pp.61-68, 1996).

【００１５】共起関係を考慮する場合は、検索式Ｑと文
書ｉとの関連度Ｓi は、式(４)に示すように、文書ｉに
含まれる検索文字列の集合Ｊの各要素ｊに着目して得ら
れる部分関連度Ｗ_ijの総和と、上述した検索文字列の集
合Ｊの要素の組み合わせからなる集合Ｙの要素ｙそれぞ
れについて着目して得られる部分関連度Ｗ_iyの総和との
和として表される。When the co-occurrence relation is considered, the relevance Si between the search formula Q and the document i is determined by the element j of the set J of the search character strings included in the document i as shown in formula (4). the sum of the sum of the partial relevance W _ij obtained by focusing, the sum of the above-described search string portions relevance W _iy obtained by focusing on the element y each set Y comprising a combination of elements of the set J of It is expressed as

【００１６】[0016]

【数４】したがって、例えば、検索文字列｛ａ、ｂ、ｃ、ｄ｝を
指定する検索式Ｑが与えられ、文書ｉに検索文字列ａ、
ｂ、ｃが含まれていた場合に、部分関連度Ｗ_ijの算出に
式(３)で示した評価関数ｇ(tf_ij，df_j )を用いれば、文
書ｉと検索式Ｑとの関連度Ｓｉは、式(５)のように表さ
れる。(Equation 4) Therefore, for example, a search expression Q specifying a search character string {a, b, c, d} is given, and the search character string a,
When b and c are included, if the evaluation function g (tf _ij , df _j ) shown in Expression (3) is used to calculate the partial relevance W _ij , the relevance between the document i and the search formula Q is obtained. Si is represented as in equation (5).

【００１７】[0017]

【数５】ここで、文書中における検索文字列の共起関係の有無
は、利用者が文書の重要度を判断する際の指標に大きな
影響を与えると考えられるので、上述したようにして、
共起関係を考慮して関連度を評価することにより、より
有効なランキングが可能になると期待できる。(Equation 5) Here, since the presence or absence of the co-occurrence of the search character string in the document is considered to greatly affect the index when the user determines the importance of the document, as described above,
By evaluating the degree of relevance in consideration of co-occurrence relations, it can be expected that more effective ranking will be possible.

【００１８】[0018]

【発明が解決しようとする課題】上述したように、個々
の検索文字列についての統計的な情報に基づいて関連度
を求めた場合には、関連度の計算処理に要する処理負担
が小さい反面、有効なランキングが得られない場合があ
る。例えば、検索文字列｛花粉症、効果、薬｝を指定す
る検索式の入力に応じて、図９(ａ)に示す文書１〜４が
データベースから検索された場合に、図９(ｂ)に示す統
計的パラメータと式(３)に示した評価関数を用いて計算
すると、文書４における検索文字列｛薬｝の出現頻度が
非常に高いために、主観的な判断とは逆に、文書４と検
索式との関連度が最も高いという結果が得られる。As described above, when the relevance is obtained based on the statistical information about each search character string, the processing load required for the relevance calculation processing is small. Effective rankings may not be obtained. For example, when documents 1 to 4 shown in FIG. 9A are searched from the database in response to the input of a search expression specifying the search character string {hay fever, effect, medicine}, FIG. When calculated using the statistical parameters shown and the evaluation function shown in equation (3), the frequency of occurrence of the search character string {medicine} in document 4 is extremely high. The result that the degree of relevance between the search expression and the search expression is the highest is obtained.

【００１９】同様の例としては、株式欄のように会社名
の羅列を含んだ文書があり、このため、新聞記事などを
大量に蓄積しているデータベースにおいて全文検索を行
った場合には、株式欄などの部分が高いランクが付され
た文書として検索されてしまい、利用者に有効な指標を
提供することができなくなってしまう。その一方、共起
関係を考慮して関連度を評価すれば、一つの検索文字列
のみを高い頻度で含む文書よりも、複数の検索文字列を
含む文書を高く評価することが可能である。A similar example is a document including a list of company names, such as a stock column. Therefore, when a full-text search is performed in a database that stores a large amount of newspaper articles and the like, a stock search is performed. A column or the like is searched as a document with a high rank, and it is not possible to provide a user with an effective index. On the other hand, if the relevance is evaluated in consideration of the co-occurrence relation, it is possible to evaluate a document including a plurality of search character strings higher than a document including only one search character string at a high frequency.

【００２０】その反面、共起関係の寄与分の評価のため
に、関連度の算出に要する処理負担が大きくなり、全体
として、検索処理に要する時間が長くなってしまう。本
発明は、処理負担の増大を抑えながら、共起関係を考慮
した関連度を得ることが可能な関連度計算装置および関
連度計算プログラムを記録した記憶媒体並びに、関連度
計算装置を備えた情報検索システムを提供することを目
的とする。On the other hand, the processing load required for calculating the degree of relevance increases for the evaluation of the contribution of the co-occurrence relation, and the time required for the search processing becomes longer as a whole. The present invention relates to a relevance calculating device capable of obtaining a relevance in consideration of a co-occurrence relationship while suppressing an increase in processing load, a storage medium storing a relevance calculating program, and information provided with the relevance calculating device. The purpose is to provide a search system.

【００２１】[0021]

【課題を解決するための手段】図１に、請求項１および
請求項２の関連度算出装置の原理ブロック図を示す。請
求項１の発明は、２つの文書情報相互の関連度を一方の
文書情報によって指定される複数の属性に着目して算出
する関連度算出装置において、複数の属性それぞれにつ
いての２つの文書情報の関連度を示す部分関連度を評価
する部分関連度評価手段１１１と、各属性について得ら
れた評価値を積算する積算手段１１２と、複数の属性の
中から、他方の文書情報に共通して含まれている属性の
組み合わせを検出する検出手段１１３と、検出手段１１
３によって検出された複数の属性の組み合わせそれぞれ
について、共起関係による関連度の寄与分として所定の
定数を積算手段１１２による積算結果に加算する寄与分
加算手段１１４とを備えたことを特徴とする。FIG. 1 is a block diagram showing the principle of the relevance calculating apparatus according to the first and second aspects. According to a first aspect of the present invention, there is provided a relevance calculating apparatus for calculating relevance between two pieces of document information by focusing on a plurality of attributes specified by one piece of document information. A partial relevance evaluation unit 111 for evaluating the partial relevance indicating the relevance, an integration unit 112 for integrating the evaluation values obtained for each attribute, and a plurality of attributes commonly included in the other document information Detecting means 113 for detecting the combination of the attributes being detected, and detecting means 11
And a contribution adding means 114 for adding a predetermined constant to the integration result by the integration means 112 as a contribution of the degree of association by the co-occurrence relation for each combination of the plurality of attributes detected by step 3. .

【００２２】請求項１の発明は、検出手段１１３による
検出結果に応じて寄与分加算手段１１４が動作すること
により、部分関連度評価手段１１１と積算手段１１２と
によって得られる個々の属性に関する部分関連度の総和
に、出現した共起関係それぞれの評価値として所定の定
数を加えて、共起関係を考慮した関連度を少ない計算量
で得ることができる。According to the first aspect of the present invention, the contribution adding means 114 operates in accordance with the detection result of the detecting means 113, whereby the partial relations relating to the individual attributes obtained by the partial relevance evaluating means 111 and the integrating means 112 are obtained. By adding a predetermined constant to the sum of degrees as an evaluation value of each of the co-occurrence relations that have appeared, it is possible to obtain the degree of relevance considering the co-occurrence relations with a small amount of calculation.

【００２３】請求項２の発明は、２つの文書情報相互の
関連度を一方の文書情報によって指定される複数の属性
に着目して算出する関連度算出装置において、複数の属
性それぞれについての２つの文書情報の関連度を示す部
分関連度を評価する部分関連度評価手段１１１と、各属
性について得られた評価値を積算する積算手段１１２
と、複数の属性の中から、他方の文書情報に共通して含
まれている属性の組み合わせを検出する検出手段１１３
と、検出手段１１３によって検出された複数の属性の組
み合わせそれぞれについて、共起関係にある各属性につ
いて部分関連度評価手段１１１によって得られた部分関
連度の関数を用いて近似値を評価し、積算手段１１２に
よる積算結果にこの近似値を加算する共起関係評価手段
１１５とを備えたことを特徴とする。According to a second aspect of the present invention, there is provided a relevance calculating apparatus for calculating relevance between two pieces of document information by focusing on a plurality of attributes specified by one piece of document information. Partial relevance evaluation means 111 for evaluating the partial relevance indicating the relevance of the document information, and integration means 112 for integrating the evaluation values obtained for each attribute
Detecting means 113 for detecting, from a plurality of attributes, a combination of attributes commonly included in the other document information.
And for each combination of the plurality of attributes detected by the detection unit 113, an approximate value is evaluated using the function of the partial relevance obtained by the partial relevance evaluation unit 111 for each attribute in the co-occurrence relationship. And a co-occurrence relation evaluation unit 115 for adding the approximate value to the result of integration by the unit 112.

【００２４】請求項２の発明は、検出手段１１３による
検出結果に応じて共起関係評価手段１１５が動作するこ
とにより、共起関係による寄与分は、各属性についての
部分関連度の関数を用いて評価され、部分関連度評価手
段１１１と積算手段１１２とによって得られる個々の属
性に関する部分関連度の総和に加えられる。この場合
は、共起関係にある各属性の統計的特徴も考慮して、共
起関係による寄与分を評価することができるから、共起
関係の出現頻度などの統計的情報を関連度に反映して、
関連度をより精密に評価することができる。According to a second aspect of the present invention, the co-occurrence relation evaluation means 115 operates according to the detection result of the detection means 113, so that the contribution due to the co-occurrence relation uses a function of the degree of partial relevance for each attribute. Is added to the sum of the partial relevance for each attribute obtained by the partial relevance evaluating means 111 and the integrating means 112. In this case, it is possible to evaluate the contribution of the co-occurrence relation in consideration of the statistical characteristics of each attribute in the co-occurrence relation, so that statistical information such as the frequency of occurrence of the co-occurrence relation is reflected in the degree of association. do it,
The degree of association can be evaluated more precisely.

【００２５】請求項３の発明は、２つの文書情報相互の
関連度を一方の文書情報によって指定される複数の属性
に着目して算出する関連度算出プログラムを記録した記
憶媒体において、複数の属性それぞれについての２つの
文書情報の関連度を示す部分関連度を評価する部分関連
度評価手順と、各属性について得られた評価値を積算す
る積算手順と、複数の属性の中から、他方の文書情報に
共通して含まれている属性の組み合わせを検出する検出
手順と、検出手順において検出された複数の属性の組み
合わせそれぞれについて、共起関係にある各属性につい
て部分関連度評価手順において得られた部分関連度の関
数を用いて近似値を評価し、積算手順によって積算結果
にこの近似値を加算する共起関係評価手順とをコンピュ
ータに実行させるプログラムを記録していることを特徴
とする。According to a third aspect of the present invention, there is provided a storage medium storing a relevance calculating program for calculating a relevance between two pieces of document information by focusing on a plurality of attributes specified by one piece of document information. A partial relevance evaluation procedure for evaluating a partial relevance indicating the relevance of two pieces of document information for each, an integration procedure for integrating the evaluation values obtained for each attribute, and the other document from among a plurality of attributes A detection procedure for detecting a combination of attributes commonly included in the information and, for each combination of a plurality of attributes detected in the detection procedure, a partial relevance evaluation procedure for each attribute in a co-occurrence relationship The computer evaluates the approximate value using the function of the partial relevance, and causes the computer to execute a co-occurrence relation evaluation procedure of adding the approximate value to the integration result by the integration procedure. Characterized in that it records the program.

【００２６】請求項３の発明は、検出手順における検出
結果に応じて共起関係評価手順を実行することにより、
共起関係による寄与分は、各属性についての部分関連度
の関数を用いて評価され、部分関連度評価手順と積算手
順とによって得られる個々の属性に関する部分関連度の
総和に加えられる。この場合は、共起関係にある各属性
の統計的特徴も考慮して、共起関係による寄与分を評価
することができるから、共起関係の出現頻度などの統計
的情報を関連度に反映して、関連度をより精密に評価す
ることができる。According to a third aspect of the present invention, a co-occurrence relation evaluation procedure is executed according to a detection result in the detection procedure.
The contribution by the co-occurrence relationship is evaluated using a function of the partial relevance for each attribute, and is added to the sum of the partial relevance for each attribute obtained by the partial relevance evaluation procedure and the integration procedure. In this case, it is possible to evaluate the contribution of the co-occurrence relation in consideration of the statistical characteristics of each attribute in the co-occurrence relation, so that statistical information such as the frequency of occurrence of the co-occurrence relation is reflected in the degree of association. Thus, the degree of association can be evaluated more precisely.

【００２７】図２に、請求項４の情報検索システムの原
理ブロック図を示す。請求項４の発明は、複数の検索文
字列を含む検索式の入力に応じて、文書データベース１
０１から検索手段１０２によって検索した文書の集合に
ついて、検索式との関連度に応じた順位を付与するラン
キング処理手段１０３を備えた情報検索システムにおい
て、ランキング処理手段１０３は、複数の検索文字列を
受け取り、それぞれが文書データベース１０１に蓄積さ
れた文書中に含まれている頻度を示す出現文書数が少な
い順序に並べ替えて出力する整列手段１２１と、整列手
段１２１から順次に検索文字列を受け取り、この検索文
字列を含むとして検索手段１０２によって検索された各
文書と検索式とについて、該当する検索文字列に着目し
た関連度を示す部分関連度をそれぞれ算出する部分関連
度算出手段１２２と、部分関連度算出手段１２２によっ
て得られた部分関連度を文書ごとに積算する集計手段１
２３と、部分関連度算出手段１２２によって、新たな検
索文字列についての部分関連度を算出するごとに、該当
する文書それぞれについて、この検索文字列と既に寄与
分が集計された検索文字列のいずれかとの共起関係を検
出する共起関係検出手段１２４と、共起関係検出手段１
２４によって共起関係が検出されるごとに、該当する文
書について集計手段１２３によって得られた積算結果を
所定の定数倍して、検出された共起関係による寄与分を
含んだ集計結果とする寄与分付加手段１２５と、集計手
段１２３による集計結果に基づいて、検索式との関連度
を算出した文書を順位付けして、検索結果の出力処理に
供する順位付け手段１２６とを備えたことを特徴とす
る。FIG. 2 is a block diagram showing the principle of an information retrieval system according to the present invention. According to a fourth aspect of the present invention, there is provided a document database system comprising:
In an information search system including a ranking processing unit 103 for assigning a rank according to the degree of association with a search formula to a set of documents searched by the search unit 102 from 01, the ranking processing unit 103 converts a plurality of search character strings into Receiving means for receiving and retrieving search character strings from the sorting means 121 in order, the sorting means 121 sorting and outputting the documents in the order in which the number of appearing documents indicating the frequency included in the documents stored in the document database 101 is small, A partial relevance calculating unit 122 that calculates a partial relevance indicating the relevance focusing on the corresponding search character string for each document and the search formula searched by the search unit 102 as including the search character string; Tallying means 1 for integrating the partial relevance obtained by the relevance calculating means 122 for each document
23 and each time the partial relevance calculating means 122 calculates the partial relevance of a new search character string, for each applicable document, either the search character string or the search character string for which the contribution has already been tabulated. A co-occurrence relation detecting means 124 for detecting a co-occurrence relation with the person;
24, each time a co-occurrence relationship is detected, the integration result obtained by the tallying unit 123 for the corresponding document is multiplied by a predetermined constant to obtain a tally result including the contribution by the detected co-occurrence relationship. And a ranking unit that ranks the documents whose relevance to the search formula is calculated based on the counting result of the counting unit and provides the search result to the output processing of the search result. And

【００２８】請求項４の発明は、整列手段１２１によっ
て決定された順序に従って、部分関連度算出手段１２２
および集計手段１２３が動作し、また、この過程で共起
関係検出手段１２４によって共起関係が検出されるごと
に、寄与分付加手段１２５が動作することにより、所定
のモデルが成立する場合に成り立つ近似式に従って、共
起関係による寄与分を含んだ関連度を簡単な処理によっ
て算出し、順位付け手段１２６による処理に供すること
ができる。According to the fourth aspect of the present invention, the partial relevance calculating means 122 is used in accordance with the order determined by the sorting means 121.
In addition, each time the co-occurrence relation is detected by the co-occurrence relation detecting means 124 in this process, the contribution adding means 125 is operated, so that a case where a predetermined model is established is established. According to the approximation formula, the degree of relevance including the contribution due to the co-occurrence relation can be calculated by a simple process and provided to the process by the ranking means 126.

【００２９】[0029]

【発明の実施の形態】以下、図面に基づいて、本発明の
実施形態について詳細に説明する。図３に、請求項１の
関連度計算装置を適用した情報検索システムの構成を示
す。この情報検索システムは、図７に示した従来の情報
検索システムに、加算処理部２１１を付加し、この加算
処理部２１１が、積算処理部４１４からの指示に応じ
て、バッファ４１５の該当する文書の部分関連度に共起
関係の寄与分として、所定の定数C1を加算する構成とな
っている。Embodiments of the present invention will be described below in detail with reference to the drawings. FIG. 3 shows a configuration of an information retrieval system to which the relevance calculating device according to claim 1 is applied. This information retrieval system adds an addition processing unit 211 to the conventional information retrieval system shown in FIG. 7, and the addition processing unit 211 responds to an instruction from the integration processing unit 414 to store a corresponding document in a buffer 415. A predetermined constant C1 is added to the degree of partial relevance as a contribution of the co-occurrence relation.

【００３０】また、図４に、この情報検索システムにお
けるランキング処理動作を表す流れ図を示す。図８に示
したステップ５０１、５０２の処理と同様に、演算処理
部４１３は、検索制御部４１２を介して文書IDを受け取
って、請求項１で述べた部分関連度評価手段１１１とし
て動作し、この文書IDで示される文書ｉについて、検索
文字列ｊについての部分関連度Ｗ_ijを算出すればよい
（ステップ３０１、３０２）。FIG. 4 is a flowchart showing a ranking processing operation in the information retrieval system. Similar to the processing of steps 501 and 502 shown in FIG. 8, the arithmetic processing unit 413 receives the document ID via the search control unit 412, and operates as the partial relevance evaluation unit 111 described in claim 1. For the document i indicated by the document ID, the partial relevance W _ij for the search character string j may be calculated (steps 301 and 302).

【００３１】この部分関連度Ｗ_ijを受け取って、積算処
理部４１４は、ステップ５０３と同様に、この文書ｉに
ついての部分関連度Ｗ_ijが既にバッファ４１５に保持さ
れているか否かを判定する（ステップ３０１〜ステップ
３０３）。ここで、バッファ４１５に既に文書ｉの部分
関連度Ｗ_ijが保持されているということは、すなわち、
以前に部分関連度Ｗ_ijの算出対象となった検索文字列と
今回部分関連度Ｗ_ijの算出対象となった検索文字列ｊと
の間に共起関係が存在することを示している。Upon receiving the partial relevance W _ij , the integration processing unit 414 determines whether the partial relevance W _ij for the document i is already stored in the buffer 415 as in step 503. Steps 301 to 303). Here, the fact that the partial relevance W _ij of the document i is already stored in the buffer 415 means that:
Indicates that there is a co-occurrence relation between the previously partially relevance W _ij search string j became calculation target of the calculation target since the search string and the current partial relevance W _ij of.

【００３２】したがって、ステップ３０３の肯定判定の
場合に、積算処理部４１４は、加算処理部２１１に共起
関係の寄与分の加算を指示し、これに応じて、加算処理
部２１１が、バッファ４１５に保持された文書ｉの部分
関連度に定数C1を加算し（ステップ３０４）、更に、ス
テップ５０４と同様に、新たに算出した部分関連度を加
算すればよい（ステップ３０５）。Therefore, in the case of an affirmative determination in step 303, the integration processing unit 414 instructs the addition processing unit 211 to add the contribution of the co-occurrence relationship, and accordingly, the addition processing unit 211 Then, the constant C1 is added to the partial relevance of the document i stored in (step 304), and the newly calculated partial relevance may be added similarly to step 504 (step 305).

【００３３】一方、ステップ３０３の否定判定の場合
は、ステップ５０５と同様にして、新規の部分関連度を
保持すればよい（ステップ３０６）。このように、ステ
ップ３０３の判定結果に応じて積算処理部４１４が動作
して、検索文字列ｊを含む文書それぞれについて部分関
連度を積算することにより、請求項１で述べた積算手段
１１２の機能が各文書について並行して実現されてい
る。On the other hand, in the case of a negative determination in step 303, the new partial relevance may be held as in step 505 (step 306). As described above, the integration processing unit 414 operates according to the determination result of step 303, and integrates the partial relevance for each document including the search character string j. Are implemented in parallel for each document.

【００３４】上述したステップ３０６およびステップ３
０５の処理の終了後は、図８に示したステップ５０６〜
ステップ５０７と同様にして、検索文字列ｊを含む全て
の文書および全ての検索文字列について処理が終了した
か否かを判定し（ステップ３０７およびステップ３０
８）、ステップ３０８の肯定判定に応じて、ステップ５
０８と同様のソート処理を行って（ステップ３０９）、
処理を終了すればよい。Steps 306 and 3 described above
After the processing of step S05 is completed, steps 506 to 506 shown in FIG.
In the same manner as in step 507, it is determined whether the processing has been completed for all documents including the search character string j and all search character strings (step 307 and step 30).
8) In response to the affirmative determination in step 308, step 5
The same sort processing as in step 08 is performed (step 309).
What is necessary is just to end a process.

【００３５】このように、ステップ３０３の判定結果に
応じて、積算処理部４１４および加算処理部２１１が動
作することにより、請求項１で述べた検出手段１１３お
よび寄与分加算手段１１４の機能を実現し、共起関係の
検出に応じて、該当する文書ｉの部分関連度に所定の寄
与分を加算していくことができる。例えば、図９に示し
た文書１から文書４について、定数C1を数値「３」とし
て上述したランキング処理を適用した場合に、各文書に
ついて得られる関連度および共起関係による寄与分を表
１に示す。As described above, the functions of the detecting means 113 and the contribution adding means 114 described in claim 1 are realized by operating the integrating processing section 414 and the adding processing section 211 in accordance with the result of the determination in step 303. Then, in accordance with the detection of the co-occurrence relationship, a predetermined contribution can be added to the partial relevance of the corresponding document i. For example, when the above-described ranking process is applied to the documents 1 to 4 shown in FIG. 9 with the constant C1 being a numerical value “3”, the relevance and the contribution by the co-occurrence relationship obtained for each document are shown in Table 1. Show.

【００３６】[0036]

【表１】この場合は、文書４については、全く共起関係による寄
与分が加算されないのに対して、文書２では、検索文字
列｛薬｝と検索文字列｛効果｝との共起関係による寄与
分が加算され、文書１および文書３については、検索文
字列｛花粉症｝も含めた３つの検索文字列についての共
起関係による寄与分が加算される。[Table 1] In this case, for the document 4, the contribution due to the co-occurrence relation is not added at all, whereas for the document 2, the contribution due to the co-occurrence relation between the search character string {drug} and the search character string {effect} is not added. For documents 1 and 3, contributions due to co-occurrence relations for three search character strings including the search character string {hay fever} are added.

【００３７】このため、文書１が最も高く評価され、次
いで文書３、文書２の関連度として大きな値が得られ、
文書４の関連度が最も低く評価される。このように、共
起関係の寄与分として、適切な値を設定すれば、上述し
たような簡便な処理によって、共起関係を考慮して検索
文字列と文書との関連度を算出し、検索文書の取捨選択
のためのより有効な指標として、利用者に提供すること
が可能となる。For this reason, document 1 is evaluated the highest, and then a large value is obtained as the degree of relevance of documents 3 and 2.
The relevance of the document 4 is evaluated to be the lowest. As described above, if an appropriate value is set as the contribution of the co-occurrence relation, the relevance between the search character string and the document is calculated in consideration of the co-occurrence relation by the simple processing as described above, and the search is performed. It can be provided to the user as a more effective index for selecting documents.

【００３８】なお、上述した所定の定数C1の値は、例え
ば、様々な文書と検索文字列との組み合わせについて関
連度を算出してランキングを行い、その結果が、人間の
主観的な判断により近くなるように決定すればよい。ま
た、上述したステップ３０２の処理は、請求項３で述べ
た部分関連度評価手順に相当し、また、ステップ３０３
は検出手順に、ステップ３０４は共起関係評価手順に相
当しており、ステップ３０５およびステップ３０６は積
算手順に相当している。The value of the above-mentioned predetermined constant C1 is calculated, for example, by calculating the degree of relevance for various combinations of documents and search character strings, and ranking is performed. The result is closer to human subjective judgment. What is necessary is just to determine. The processing in step 302 corresponds to the partial relevance evaluation procedure described in claim 3.
Corresponds to a detection procedure, step 304 corresponds to a co-occurrence relation evaluation procedure, and steps 305 and 306 correspond to an integration procedure.

【００３９】したがって、上述したステップ３０１から
ステップ３０９の処理をコンピュータに実行させるプロ
グラムを記録した記憶媒体を頒布すれば、広い範囲の利
用者に共起関係を考慮した関連度に基づくランキング結
果を提供し、検索された文書から重要度の高い文書を選
択する作業を支援することができる。ところで、上述し
たように、共起関係による寄与分を所定の定数とした場
合は、全ての共起関係を等価なものとして評価している
ため、共起関係にある検索文字列の統計的特徴が関連度
に全く反映されていなかった。Therefore, if a storage medium storing a program for causing a computer to execute the processes of steps 301 to 309 described above is distributed, a wide range of users can be provided with a ranking result based on the degree of association in consideration of co-occurrence relationships. In addition, it is possible to support the operation of selecting a document having high importance from the retrieved documents. By the way, as described above, when the contribution by the co-occurrence relation is set to a predetermined constant, all the co-occurrence relations are evaluated as being equivalent, so that the statistical characteristics of the search string having the co-occurrence relation are evaluated. Was not reflected in the relevance at all.

【００４０】次に、共起関係にある検索文字列の統計的
特徴を考慮して、関連度を算出する方法について説明す
る。ここで、検索文字列ａ，ｂが文書ｉに含まれていた
場合に、これらの検索文字列を指定する検索式と文書ｉ
の関連度Ｓは、検索文字列ａと検索文字列ｂとが共に出
現する文書数（以下、共起文書数と称する）df_abおよび
文書ｉにおいて検索文字列ａ，ｂが共起して出現した度
数（以下、共起度数と称する）tf_iabを用いて、式(６)
のように表される。Next, a description will be given of a method of calculating the degree of relevance in consideration of the statistical characteristics of the co-occurring search character strings. Here, when the search character strings a and b are included in the document i, a search expression that specifies these search character strings and the document i
Is related to the number of documents df _ab in which the search character string a and the search character string b appear together (hereinafter referred to as the number of co-occurring documents) and the occurrence of the search character strings a and b in the document i. Using the calculated frequency (hereinafter referred to as co-occurrence frequency) tf _iab ,
It is represented as

【００４１】[0041]

【数６】検索文字列ａの出現文書数dfa が検索文字列ｂの出現文
書数dfb よりも小さければ、当然ながら、検索文字列
ａ，ｂについての共起文書数df_abは、検索文字列ａの出
現文書数dfa と等しいかそれよりも小さい。中でも、検
索文字列ａを含む文書の集合Ｊ_aが検索文字列ｂを含む
文書の集合Ｊ_bに完全に含まれている場合は、検索文字
列ａ，ｂについての共起文書数df_abは、検索文字列ａの
出現文書数dfa に等しくなる。(Equation 6) If the number of appearing documents dfa of the search character string a is smaller than the number of appearing documents dfb of the search character string b, the number of co-occurring documents df _ab for the search character strings a and b is, of course, Less than or equal to the number dfa. Above all, if the set J _a document containing the search string a is completely contained in the set J _b of documents that contain the search string b, the search string a, co-occurrence document number df _ab for b is , The number of appearing documents dfa in the search character string a.

【００４２】したがって、上述した条件（Ｊ_a⊆Ｊ_b）
が成り立つ場合に、共起度数tf_iabを検索文字列ａの出
現度数tf_aで代用すれば、式(６)に含まれる共起関係の
寄与分を検索文字列ａについての部分関連度Ｗ_iaによっ
て置き換えることが可能であり、式(７)に示すように個
々の検索文字列に関する部分関連度Ｗ_ijの関数として、
共起関係の寄与分を含む関連度Ｓi'を表すことができ
る。Therefore, the above condition (J _a aJ _b )
If the holds, the co-occurrence frequency tf be substituted _iab in occurrence frequency tf _a of the search string a, partial relevance W _ia the search string a the contribution of the co-occurrence relationships included in the formula (6) And as a function of the partial relevance W _ij for each search string, as shown in equation (7),
The degree of association Si 'including the co-occurrence contribution can be represented.

【００４３】[0043]

【数７】もちろん、上述した条件が成立する場合は、極めて限ら
れた特殊な場合であるから、式(７)をそのまま一般化す
ることはできない。しかし、例えば、式(８)に示すよう
に、検索文字列ａについての部分関連度に定数C2を乗じ
ることによって共起関係の寄与分を含んだ項を表し、こ
の定数C2の値を適切に決定すれば、一般の場合について
の共起関係による寄与分を含んだ関連度Ｓi の近似値に
なると期待できる。(Equation 7) Of course, when the above-mentioned condition is satisfied, it is a very limited special case, so that equation (7) cannot be generalized as it is. However, for example, as shown in Expression (8), the term including the contribution of the co-occurrence relation is expressed by multiplying the partial relevance of the search character string a by the constant C2, and the value of the constant C2 is appropriately adjusted. If it is determined, it can be expected that an approximate value of the degree of association Si including the contribution by the co-occurrence relation in the general case will be obtained.

【００４４】[0044]

【数８】同様にして、検索文字列ａ，ｂ、ｃが文書ｉに含まれて
いた場合に、これらの検索文字列を指定する検索式と文
書ｉの関連度Ｓi の近似値を求める方法について説明す
る。この場合は、近似のためのモデルとして、検索文字
列ａを含む文書の集合Ｊ_aが検索文字列ｂを含む文書の
集合Ｊ_bに完全に含まれており、更に、検索文字列ｂを
含む文書の集合Ｊ_bが検索文字列ｃを含む文書の集合Ｊ
_cに完全に含まれている場合を考えればよい。(Equation 8) Similarly, in the case where the search character strings a, b, and c are included in the document i, a method for obtaining an approximate value of the relevance Si of the search expression specifying the search character strings and the document i will be described. In this case, as a model for approximation, it is completely included in the set J _b document set J _a document containing the search string a contains the search string b, further comprising a search string b A document set _Jb whose document set Jb contains the search character string c
Consider the case where it is completely included in _c .

【００４５】上述したモデルでは、検索文字列の集合
｛ａ，ｂ，ｃ｝の部分集合のうち検索文字列ａをその一
部として含む部分集合についての共起文書数df_ab,df_ac,
df_abcは、検索文字列ａの出現文書数df_aに等しく、ま
た、検索文字列ｂ、ｃについての共起文書数df_bcは、検
索文字列ｂの出現文書数df_bに等しい。このことを利用
して、式(５)を変形すれば、上述した条件(Ｊ_a⊆Ｊ_b⊆
Ｊ_c) が成り立つ場合における文書ｉの関連度Ｓi'は、
式(９)に示すように、各検索文字列の部分関連度Ｗ_ijの
関数として表される。In the above-described model, the number of co-occurring documents df _ab , df _ac , for the subset of the search character string set {a, b, c} that includes the search character string a as a part thereof
df _abc is equal to the occurrence document number df _a search string a, The search string b, co-occurrence document number df _bc for c is equal to the appearance document number df _b search string b. Using this, by transforming equation (5), the above condition (J _a aJ _b ⊆
_Jc ) holds, the relevance Si 'of document i is
As shown in Expression (9), the search character string is expressed as a function of the partial relevance W _ij .

【００４６】[0046]

【数９】したがって、一般の場合についての関連度の近似値Ｓi
は、上述した定数C2と各検索文字列の部分関連度Ｗ_ijと
を用いて、式(１０)に示すように表される。(Equation 9) Therefore, the approximate value Si of the degree of association for the general case
Is expressed as shown in Expression (10) using the above-described constant C2 and the partial relevance W _ij of each search character string.

【数１０】同様に、ｎ個の検索文字列term₁〜term_nの出現文書数
df_term1〜df_termnが式(１１)に示す関係を満たしている
とすれば、これらの検索文字列term₁〜term_nを含む文
書ｉの共起関係を考慮した関連度の近似値Ｓi は、式
(１２)によって表すことができる。(Equation 10) Similarly, the number of appeared documents of the n-number of the search string term ₁ ~term _n
if _df _term1 ~df termn meets the relationship shown in equation (11), the approximate value Si of relevance in consideration of cooccurrence relation document i containing these search strings term ₁ ~term _n is formula
It can be represented by (12).

【００４７】[0047]

【数１１】 [Equation 11]

【数１２】ここで、式(１２)を導くために想定したモデルにおいて
は、検索文字列term₁〜term_nが出現する文書の集合Ｊ
_term1〜Ｊ_termnは、式(１３)に示す入れ子状の包含関係
を満たしていることを考えれば、式(１２)を式(１４)に
示すように変形することができる。(Equation 12) Here, in the model assumed to derive the equation (12), a set J of documents in which the search character strings term _{1 to} term _n appear
_{Considering that terms1 to} _Jtermn satisfy the nested inclusion relationship shown in equation (13), equation (12) can be modified as shown in equation (14).

【００４８】[0048]

【数１３】 (Equation 13)

【数１４】これにより、式(１１)に示す大小関係に従って、出現文
書数が小さい検索文字列から順序に部分関連度の算出対
象としていけば、図４に示した手順と同様の関連度の算
出過程において、ステップ３０３と同様にして共起関係
を見いだすごとに、既に算出された部分関連度の積算結
果に定数C2を乗じることにより、既にその寄与分が算出
されている検索文字列と注目している検索文字列との共
起関係による寄与分を含めた部分和を得られることが分
かる。[Equation 14] By doing so, in order to calculate the partial relevance in order from the search character string having the smaller number of appearing documents in accordance with the magnitude relation shown in Expression (11), in the process of calculating the relevance similar to the procedure shown in FIG. Each time a co-occurrence relationship is found in the same manner as in step 303, the calculated result of the partial relevance is multiplied by a constant C2 to obtain a search character string whose contribution is already calculated. It can be seen that a partial sum including the contribution due to the co-occurrence relationship with the character string can be obtained.

【００４９】このように、それまでに算出対象となった
検索文字列についての部分関連度の関数として、処理済
みの検索文字列と現在注目している検索文字列との共起
関係の寄与分を表すことにより、従来の関連度算出処理
とほぼ同等の計算量で、共起関係にある検索文字列の統
計的特徴を考慮した関連度を求めることが可能となる。
図５に、請求項４の情報検索システムの実施形態を示
す。As described above, the contribution of the co-occurrence relationship between the processed search character string and the search character string currently focused on as a function of the partial relevance of the search character string which has been calculated so far. By expressing, it is possible to obtain the relevance in consideration of the statistical characteristics of the co-occurring search character string with a calculation amount substantially equal to that of the conventional relevance calculation processing.
FIG. 5 shows an embodiment of the information retrieval system according to claim 4.

【００５０】この場合は、図５に示すように、図３に示
した加算処理部２１１に代えて、乗算処理部２１２を備
えて関連度算出装置を構成し、この乗算処理部２１２
が、積算処理部４１４からの指示に応じて、バッファ４
１５に保持された該当する部分関連度に定数C2を乗じ、
この乗算結果をバッファ４１５に保持する構成とすれば
よい。In this case, as shown in FIG. 5, a relevance calculating device is provided with a multiplication processing unit 212 instead of the addition processing unit 211 shown in FIG.
Is in response to an instruction from the integration processing unit 414,
15 is multiplied by the constant C2 to the corresponding partial relevance,
The result of the multiplication may be held in the buffer 415.

【００５１】また、図５において、順序決定部２１３
は、受付処理部４１１を介して検索文字列の集合Ｑを受
け取り、これらの検索文字列に対応する出現文書数に基
づいて、部分関連度の算出順序を決定する構成となって
いる。この順序決定部２１３は、例えば、検索文字列の
集合の入力に応じて、検索手段１０２に相当するデータ
ベース管理システム４０２に対して対応する出現文書数
を要求し、受け取った出現文書数が小さい順に検索文字
列の順序を並べ替えて、この一連の検索文字列を検索制
御部４１２の処理に供すればよい。In FIG. 5, the order determining unit 213
Is configured to receive a set Q of search character strings via the reception processing unit 411 and determine the calculation order of the degree of partial relevance based on the number of appearing documents corresponding to these search character strings. For example, in response to input of a set of search character strings, the order determination unit 213 requests the number of corresponding appearing documents from the database management system 402 corresponding to the search unit 102, and receives the smaller number of appearing documents in ascending order. The order of the search character strings may be rearranged, and this series of search character strings may be provided to the processing of the search control unit 412.

【００５２】図６に、ランキング処理動作を表す流れ図
を示す。この場合は、関連度算出処理に先立って、ま
ず、上述した順序決定部２１３が請求項４で述べた整列
手段１２１として動作し、検索文字列を部分関連度の算
出対象とする順序に並べ替え（ステップ３１１）、この
順序に従って、検索制御部４１２は、データベース管理
システム４０２に検索文字列を指定して該当する文書を
検索し、得られた文書IDの列とともに、この検索文字列
を部分関連度の算出対象として演算処理部４１３に送出
すればよい。FIG. 6 is a flowchart showing the ranking processing operation. In this case, prior to the relevance calculation process, first, the above-described order determination unit 213 operates as the sorting unit 121 described in claim 4, and rearranges the search character strings into the order in which the partial relevance is to be calculated. (Step 311) In accordance with this order, the search control unit 412 specifies a search character string in the database management system 402 and searches for a corresponding document. What is necessary is just to send to the arithmetic processing part 413 as a calculation object of a degree.

【００５３】これに応じて、演算処理部４１３および積
算処理部４１４により、図４に示したステップ３０１か
らステップ３０３と同様の処理が行われる（ステップ３
１２〜ステップ３１４）。このとき、演算処理部４１３
により、請求項４で述べた部分関連度算出手段１２２の
機能が果たされ、また、積算処理部４１４が、ステップ
３１４の判定処理を行うことにより、共起関係検出手段
１２４の機能が果たされている。In response, the processing similar to steps 301 to 303 shown in FIG. 4 is performed by the arithmetic processing unit 413 and the integration processing unit 414 (step 3).
12 to step 314). At this time, the arithmetic processing unit 413
Accordingly, the function of the partial relevance calculating means 122 described in claim 4 is fulfilled, and the function of the co-occurrence relation detecting means 124 is fulfilled by the integration processing section 414 performing the determination processing of step 314. Have been.

【００５４】このステップ３１４の肯定判定に応じて、
乗算処理部２１２が請求項４で述べた寄与分付加手段１
２５として動作し、ステップ３１５において、バッファ
４１５に保持されている該当する文書ｉの部分関連度Ｗ
_ijに定数C2を乗じ、更に、ステップ３１６において、積
算処理部４１４が、新たに算出された部分関連度を加算
することにより、処理済みの検索文字列と現在注目して
いる検索文字列ｊとの共起関係の寄与分を含んだ部分関
連度の部分和が得られ、文書ｉに対応する部分関連度Ｗ
_ijとしてバッファ４１５に保持される。In response to the affirmative determination in step 314,
5. The contribution adding means 1 according to claim 4, wherein
25, and in step 315, the partial relevance W of the corresponding document i held in the buffer 415.
_ij is multiplied by the constant C2, and in step 316, the integration processing unit 414 adds the newly calculated partial relevance to obtain the processed search character string and the search character string j currently focused on. , A partial sum of the partial relevance including the contribution of the co-occurrence relation is obtained, and the partial relevance W corresponding to the document i is obtained.
_ij is held in the buffer 415.

【００５５】一方、ステップ３１４の否定判定の場合
は、ステップ３０６と同様にして、新規の部分関連度を
保持すればよい（ステップ３１７）。このように、ステ
ップ３１４の判定結果に応じて、積算処理部４１４が動
作し、バッファ４１５内の該当する文書ｉの部分関連度
の値を操作することにより、請求項４で述べた集計手段
１２３の機能が果たされている。On the other hand, in the case of a negative determination in step 314, the new partial relevance may be held as in step 306 (step 317). As described above, the accumulation processing unit 414 operates according to the determination result of step 314, and operates the value of the partial relevance of the corresponding document i in the buffer 415, thereby obtaining the totaling unit 123 described in claim 4. The function of has been fulfilled.

【００５６】また、この場合は、順序決定部２１３によ
って決定された順序で部分関連度の算出処理を行い、部
分関連度の算出過程において、上述したステップ３１４
の判定結果に応じて、乗算処理部２１２が動作すること
により、請求項２で述べた共起関係評価手段１１５の機
能が果たされており、共起関係による寄与分を含んだ関
連度が、共起関係にある各検索文字列についての部分関
連度の関数として評価されている。In this case, the process of calculating the partial relevance is performed in the order determined by the order determining unit 213, and in the process of calculating the partial relevance, the above-described step 314 is performed.
The function of the co-occurrence relation evaluation means 115 described in claim 2 is fulfilled by operating the multiplication processing section 212 in accordance with the determination result of Is evaluated as a function of the degree of partial relevance of each search character string in co-occurrence.

【００５７】上述したステップ３１７および上述したス
テップ３１６の処理の終了後は、図４に示したステップ
３０７〜ステップ３０８と同様にして、検索文字列ｊを
含む全ての文書および全ての検索文字列について処理が
終了したか否かを判定し（ステップ３１８〜ステップ３
１９）、ステップ３１９の肯定判定に応じて、ソート処
理部４１６が順位付け手段１２６として動作し、ステッ
プ３０９と同様のソート処理を行って（ステップ３２
０）、処理を終了すればよい。After the processing of step 317 and step 316 described above is completed, in the same manner as in steps 307 to 308 shown in FIG. 4, all the documents including the search character string j and all the search character strings are searched. It is determined whether the processing is completed (steps 318 to 3
19), in response to the affirmative determination in step 319, the sort processing unit 416 operates as the ranking means 126, and performs the same sort processing as in step 309 (step 32).
0), the process may be terminated.

【００５８】上述したように、従来の関連度算出処理
に、検索文字列をソートする処理と、共起関係の検出に
応じて定数を乗じる処理を加えることにより、検索文字
列についての統計的情報を考慮しつつ、共起関係の寄与
分を含んだ関連度を算出し、この関連度に応じた順位を
付与することができ、請求項４で述べたランキング処理
手段１０３の機能を実現することができる。As described above, by adding the processing of sorting the search character string and the processing of multiplying the constant by the detection of the co-occurrence relation to the conventional relevance calculation processing, the statistical information on the search character string is obtained. , The degree of relevance including the contribution of the co-occurrence relation can be calculated, and a rank corresponding to the degree of relevance can be assigned, thereby realizing the function of the ranking processing means 103 described in claim 4. Can be.

【００５９】このために付加された処理によって増大す
る計算量は極く小さく、その一方、共起関係の対象とな
る検索文字列の統計的特徴を考慮した共起関係による寄
与分を評価したことによって、各文書について得られた
関連度を人間が主観的に評価する重要度に近似させるこ
とができる。例えば、図９に示した文書１〜文書４につ
いて、部分関連度の評価式として式３を用い、共起関係
の寄与分の算出に利用する定数C2の値として数値「２」
を与えて関連度を計算した結果を表２に示す。For this reason, the amount of calculation that is increased by the processing added is extremely small. On the other hand, the contribution by the co-occurrence relation in consideration of the statistical characteristics of the search character string to be co-occurred is evaluated. Thereby, the degree of relevance obtained for each document can be approximated to the degree of importance that humans subjectively evaluate. For example, for documents 1 to 4 shown in FIG. 9, Expression 3 is used as an expression for evaluating the degree of partial relevance, and a numerical value “2” is used as the value of a constant C2 used for calculating the contribution of the co-occurrence relationship.
Table 2 shows the result of calculating the degree of association by giving

【００６０】[0060]

【表２】この場合は、共起関係の寄与分は、関係する検索文字列
の出現度数および出現文書数の逆数を重みとして評価さ
れる。したがって、検索文字列｛薬｝や検索文字列｛効
果｝のように、比較的多くの文書に出現する検索文字列
についての共起関係よりも、検索文字列｛花粉症｝のよ
うに限られた文書中にのみ出現する検索文字列について
の共起関係を重視し、しかも、重要視した共起関係の出
現頻度に応じて、高い評価値を与えることができる。[Table 2] In this case, the contribution of the co-occurrence relation is evaluated using the frequency of appearance of the related search character string and the reciprocal of the number of appearance documents as weights. Therefore, rather than the co-occurrence relationship of search strings that appear in a relatively large number of documents, such as search strings {drugs} and search strings {effects}, they are more limited to search strings {hay fever} Emphasis is placed on the co-occurrence relationship of a search character string that appears only in a document, and a high evaluation value can be given according to the appearance frequency of the co-occurrence relationship that is regarded as important.

【００６１】なお、上述した例では、共起関係の寄与分
に用いる定数C2の値として数値「２」を用いたが、十分
な数の文書について、図５に示した関連度算出装置で得
られたランキング結果と、人間によるランキング結果と
を照合する実験などに基づいて、適切な定数C2の値を求
めれば、更に、人間による重要度の評価結果に近いラン
キング結果を得ることが期待できる。In the above-described example, the numerical value "2" is used as the value of the constant C2 used for the contribution of the co-occurrence relationship. However, a sufficient number of documents are obtained by the relevance calculating device shown in FIG. If an appropriate value of the constant C2 is obtained based on an experiment for collating the obtained ranking result with the ranking result by a human, it is expected that a ranking result closer to the evaluation result of the importance by a human will be obtained.

【００６２】また、演算処理部４１３において利用する
評価式として、参考文献などで提案された様々な評価式
を用いれば、その評価式による特徴を継承することがで
きる。Further, if various evaluation formulas proposed in reference documents and the like are used as the evaluation formulas used in the arithmetic processing unit 413, the features of the evaluation formulas can be inherited.

【００６３】[0063]

【発明の効果】以上に説明したように、請求項１の発明
によれば、極めて単純な処理によって共起関係の寄与分
を関連度の評価結果に付加することができるので、関連
度算出処理の計算量を大幅に増大させることなく、共起
関係を考慮した関連度を求めることができる。As described above, according to the first aspect of the present invention, the contribution of the co-occurrence relationship can be added to the evaluation result of the relevance by a very simple process. Can be obtained without significantly increasing the amount of calculation of.

【００６４】これにより、例えば、この関連度に基づい
て、検索文書をランキングすれば、利用者が検索文書を
重要度に応じて選別する際に、より有効な指標を与える
ことができる。請求項２および請求項３の発明によれ
ば、共起関係による寄与分を既に処理済みの属性につい
て得られた部分関連度の関数によって近似することによ
り、各属性の統計的特徴に相当する重みを共起関係の寄
与分に付与することができるから、共起関係による寄与
分をより精密に評価することができる。Thus, for example, if the search documents are ranked based on the relevance, a more effective index can be given when the user selects the search documents according to the importance. According to the second and third aspects of the present invention, the weights corresponding to the statistical characteristics of each attribute are approximated by approximating the contribution by the co-occurrence relation by the function of the partial relevance obtained for the already processed attribute. Can be given to the contribution of the co-occurrence relation, so that the contribution by the co-occurrence relation can be evaluated more precisely.

【００６５】特に、請求項４の発明は、各検索文字列に
ついての出現文書数に応じた順序で部分関連度を算出す
ることにより、処理済みの検索文字列についての部分関
連度の部文和に定数を乗じることによって、検索文字列
の統計的特徴に相当する重みを用いて精密に評価した共
起関係による寄与分を含んだ関連度の近似値を求めるこ
とができる。In particular, the invention according to claim 4 calculates the partial relevance in an order corresponding to the number of documents appearing for each search character string, thereby obtaining a partial sentence sum of the partial relevance for the processed search character string. Is multiplied by a constant, it is possible to obtain an approximate value of the degree of relevance including the contribution due to the co-occurrence relationship precisely evaluated using the weight corresponding to the statistical feature of the search character string.

[Brief description of the drawings]

【図１】請求項１および請求項２の関連度算出装置の原
理ブロック図である。FIG. 1 is a principle block diagram of a relevance calculating apparatus according to claims 1 and 2;

【図２】請求項４の情報検索システムの原理ブロック図
である。FIG. 2 is a block diagram showing the principle of an information retrieval system according to claim 4;

【図３】請求項１の関連度算出装置を適用した情報検索
システムの構成を示す図である。FIG. 3 is a diagram showing a configuration of an information search system to which the relevance calculating device of claim 1 is applied.

【図４】ランキング処理動作を表す流れ図である。FIG. 4 is a flowchart illustrating a ranking processing operation.

【図５】請求項４の情報検索システムの実施形態を示す
図である。FIG. 5 is a diagram showing an embodiment of an information search system according to claim 4;

【図６】ランキング処理動作を表す流れ図である。FIG. 6 is a flowchart illustrating a ranking processing operation.

【図７】従来の関連度算出装置を適用した情報検索シス
テムの構成例を示す図である。FIG. 7 is a diagram showing a configuration example of an information search system to which a conventional relevance calculating device is applied.

【図８】従来のランキング処理動作を表す流れ図であ
る。FIG. 8 is a flowchart showing a conventional ranking processing operation.

【図９】文書情報およびランキング結果の例を示す図で
ある。FIG. 9 is a diagram illustrating an example of document information and a ranking result.

[Explanation of symbols]

１０１、４０３文書データベース１０２検索手段１０３ランキング処理手段１１１部分関連度評価手段１１２積算手段１１３検出手段１１４寄与分加算手段１１５共起関係評価手段１２１整列手段１２２部分関連度算出手段１２３集計手段１２４共起関係検出手段１２５寄与分付加手段１２６順位付け手段２１１加算処理部２１２乗算処理部２１３順序決定部４０１キーボード４０２データベース管理システム（ＤＢＭＳ）４０４表示処理部４０５ＣＲＴディスプレイ（ＣＲＴ）４１１受付処理部４１２検索制御部４１３演算処理部４１４積算処理部４１５バッファ４１６ソート処理部 101, 403 Document database 102 Search means 103 Ranking processing means 111 Partial relevance evaluation means 112 Accumulation means 113 Detection means 114 Contribution addition means 115 Co-occurrence relation evaluation means 121 Alignment means 122 Partial relevance calculation means 123 Aggregation means 124 Co-occurrence Relationship detection means 125 Contribution addition means 126 Ranking means 211 Addition processing unit 212 Multiplication processing unit 213 Order determination unit 401 Keyboard 402 Database management system (DBMS) 404 Display processing unit 405 CRT display (CRT) 411 Reception processing unit 412 Search control Unit 413 arithmetic processing unit 414 integration processing unit 415 buffer 416 sort processing unit

Claims

[Claims]

1. A relevance calculating apparatus for calculating relevance between two pieces of document information by focusing on a plurality of attributes specified by one piece of document information, wherein the relevance of two pieces of document information for each of the plurality of attributes is calculated. A partial relevance evaluating means for evaluating the partial relevance indicating the degree; an integrating means for integrating the evaluation values obtained for each attribute; and the common attribute included in the other document information from among the plurality of attributes. Detecting means for detecting a combination of attributes which are present, and for each combination of the plurality of attributes detected by the detecting means, a contribution of adding a predetermined constant to the integration result by the integration means as a contribution of the degree of association by the co-occurrence relationship. A relevance calculating apparatus comprising: a minute adding unit.

2. A relevance calculating apparatus for calculating relevance between two pieces of document information by focusing on a plurality of attributes specified by one piece of document information, wherein the relevance of the two pieces of document information for each of the plurality of attributes is calculated. A partial relevance evaluating means for evaluating the partial relevance indicating the degree; an integrating means for integrating the evaluation values obtained for each attribute; and the common attribute included in the other document information from among the plurality of attributes. Detecting means for detecting a combination of attributes that are present; and for each of the plurality of attribute combinations detected by the detecting means, a function of the partial relevance obtained by the partial relevance evaluating means for each attribute in co-occurrence. And a co-occurrence relation evaluating means for evaluating the approximate value using the sum and adding the approximate value to the result of integration by the integrating means.

3. A storage medium storing a relevance calculation program for calculating relevance between two pieces of document information by focusing on a plurality of attributes specified by one piece of document information, wherein 2 A partial relevance evaluation procedure for evaluating the partial relevance indicating the relevance of two pieces of document information; an integration procedure for integrating the evaluation values obtained for each attribute; and a common procedure for the other document information from the plurality of attributes. A detection procedure for detecting a combination of attributes included as a part, and for each of a combination of a plurality of attributes detected in the detection procedure, a part obtained in the partial relevance evaluation procedure for each attribute in a co-occurrence relationship. The computer executes a co-occurrence relation evaluation procedure of evaluating an approximate value using the function of the degree of association and adding the approximate value to the integrated result by the integrating procedure. A storage medium recording a program to.

4. A ranking processing means for, in response to input of a search expression including a plurality of search character strings, assigning a rank according to the degree of relevance to the search expression to a set of documents searched by a search means from a document database. The ranking processing means receives the plurality of search character strings and sorts them in ascending order of the number of appearing documents indicating the frequency included in the documents stored in the document database. A search unit that sequentially receives a search character string from the alignment unit, and focuses on a corresponding search character string for each document and the search expression searched by the search unit as including the search character string. A partial relevance calculating means for calculating a partial relevance indicating the relevance, and a partial relevance obtained by the partial relevance calculating means Each time the partial relevance of a new search character string is calculated by the totaling means for integrating for each document and the partial relevance calculating means, the search character string and the contribution have already been totaled for each applicable document. A co-occurrence relation detecting means for detecting a co-occurrence relation with any one of the searched character strings, and each time the co-occurrence relation is detected by the co-occurrence relation detecting means, Contribution adding means for multiplying the result by a predetermined constant to obtain a totaled result including the detected co-occurrence contribution, and calculating the degree of association with the search formula based on the totalized result by the totalizing means An information search system, comprising: a ranking unit that ranks the selected documents and provides the result to a search result output process.