JP4388092B2

JP4388092B2 - Structured document database management system and program

Info

Publication number: JP4388092B2
Application number: JP2007081625A
Authority: JP
Inventors: 谷州子望月
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2007-03-27
Filing date: 2007-03-27
Publication date: 2009-12-24
Anticipated expiration: 2027-03-27
Also published as: JP2008242743A

Description

本発明は、ＸＭＬ文書のような構造化文書が格納される構造化文書データベースを管理する構造化文書データベース管理システムに係り、特に当該データベースに構造化文書を格納する際に当該構造化文書に含まれる単語の索引を作成するのに好適な構造化文書データベース管理システム及びプログラムに関する。 The present invention relates to a structured document database management system that manages a structured document database in which a structured document such as an XML document is stored, and is particularly included in the structured document when the structured document is stored in the database. The present invention relates to a structured document database management system and program suitable for creating an index of words to be recorded.

構造化文書では、タグと呼ばれる文字列で階層的な構造が表現される。構造化文書の代表として、例えばＸＭＬ（Extensible Markup Language）形式の文書、つまりＸＭＬ文書が広く知られている。ＸＭＬは、意味付けされたタグによるデータの階層化や構造の自由な拡張性という特長を持つ。このような特長を生かしたＸＭＬ利用技術として、ＸＭＬデータベース（ＸＭＬＤＢ）が知られている。ＸＭＬデータベースは、ＸＭＬデータベース管理システム（ＸＭＬＤＢＭＳ）によって制御されることにより、ＸＭＬデータの階層構造をそのままの形式で格納し、管理する機能と、問い合わせ言語による検索のための機能とを提供する。 In a structured document, a hierarchical structure is expressed by a character string called a tag. As a representative of structured documents, for example, XML (Extensible Markup Language) format documents, that is, XML documents are widely known. XML has features such as hierarchization of data by means of tagged tags and free expandability of structure. An XML database (XMLDB) is known as an XML utilization technology that makes use of such features. The XML database is controlled by an XML database management system (XMLDBMS), thereby providing a function for storing and managing the hierarchical structure of XML data as it is, and a function for searching by a query language.

ＸＭＬデータベースでは、当該データベースに格納されたＸＭＬ文書に含まれる単語を条件とした検索を高速に行わせるために、当該文書に含まれる単語による索引を構成する技術が用いられる（例えば、特許文献１参照）。このような索引は、ＸＭＬデータベースにＸＭＬ文書が格納される際に、当該ＸＭＬ文書を解析して文字の並びから単語を判定することにより構築されるのが一般的である。 In the XML database, a technique for constructing an index based on words included in the document is used in order to perform a high-speed search using words included in the XML document stored in the database (for example, Patent Document 1). reference). Such an index is generally constructed by analyzing the XML document and determining a word from the character sequence when the XML document is stored in the XML database.

また、例えば特許文献２には、構造化文書から抽出された単語文字列を当該単語文字列よりも短い短縮文字列に置き換えることにより、当該構造化文書を圧縮する技術が記載されている。この特許文献２には、入力文字が空白かを判別し、空白の場合、その空白が認識されるまでにメモリに蓄積された文字列を置換文字に置き換えることで、つまり空白毎に文字列を区切って、その区切られた文字列毎に置換文字に置き換えることで、構造化文書を短縮することが記載されている。
特開２００６−１７２２６８号公報特開２００１−６７３４８号公報 For example, Patent Document 2 describes a technique for compressing a structured document by replacing a word character string extracted from the structured document with a shortened character string shorter than the word character string. In Patent Document 2, it is determined whether or not an input character is blank. If the character is blank, the character string stored in the memory until the blank is recognized is replaced with a replacement character, that is, the character string is determined for each blank. It is described that the structured document is shortened by dividing and replacing each delimited character string with a replacement character.
JP 2006-172268 A JP 2001-67348 A

上記特許文献１に記載されたような索引構成技術（従来技術）では、スペース（間隔）、タブ、改行文字（ＬＦ（Line Feed））、復帰文字（ＣＲ（Carriage Return））のような）空白文字は他の文字と区別されずに扱われる。つまり従来技術では、空白文字はテキストを構成する１文字に過ぎない。このため、空白文字を含む単語は、そのまま空白文字を含んだ単語としてのみ認識される。 In the index construction technique (prior art) as described in Patent Document 1, a space (interval), a tab, a line feed character (LF (Line Feed)), and a carriage return character (CR (Carriage Return)) are blank. Characters are treated without distinction from other characters. In other words, in the prior art, the blank character is only one character constituting the text. For this reason, a word including a blank character is recognized only as a word including a blank character as it is.

ところで、ユーザがＸＭＬ文書を作成する場合、当該ＸＭＬ文書を表示画面（或いは印刷紙面）上で人間に見やすい形に整形することがある。例えば、意味のある単語の文字列の間に空白文字を挿入することも、文書整形の１つである。従来技術では、このような単語が検索条件として選ばれた場合、該当するＸＭＬ文書は検索されない。 By the way, when a user creates an XML document, the XML document may be shaped into a form that is easy for humans to see on the display screen (or printing paper). For example, inserting a blank character between character strings of meaningful words is one form of document formatting. In the prior art, when such a word is selected as a search condition, the corresponding XML document is not searched.

そこで、上記引用文献２に記載された技術を適用して、構造化文書に含まれている文字列を空白文字毎に区切ることにより、索引化すべき単語を抽出することが考えられる。しかし、空白文字は、表示上の整形のためだけに用いられるわけではなく、ＸＭＬ文書として必要な場合もある。このため、単純に空白文字毎に区切ったのでは、必ずしも目的とするＸＭＬ文書を検索することはできない。 Therefore, it is conceivable to apply the technique described in the above cited document 2 and extract the word to be indexed by dividing the character string included in the structured document into blank characters. However, the blank character is not used only for formatting on the display, but may be necessary as an XML document. For this reason, the target XML document cannot always be searched by simply separating the blank characters.

本発明は上記事情を考慮してなされたものでその目的は、構造化文書に含まれる空白文字のうち整形のための空白文字のみを索引の構築時に取り除くことにより、文書として意味をなす単語による索引を構築できる構造化文書データベース管理システム及びプログラムを提供することにある。 The present invention has been made in consideration of the above circumstances, and its purpose is to use words that make sense as a document by removing only white space characters for formatting among the white space characters included in structured documents when building an index. To provide a structured document database management system and program capable of constructing an index.

本発明の１つの観点によれば、構造化文書データベース管理システムが提供される。この構造化文書データベース管理システムは、タグを用いて階層的な構造が表現される、構造化文書データベースに格納されるべき構造化文書の構造を解析し、タグに後続して出現するテキストを検出する構文解析手段と、前記構文解析手段によって解析された構造化文書に出現する、構造化文書として意味をなさない空白文字列を、当該空白文字列と関連した構造を表す構造情報と対応付けて蓄積するための空白文字情報蓄積手段と、前記構文解析手段によって検出されたテキストが、空白文字のみから構成されるイグノラブル空白文字であるかを判定するイグノラブル空白文字判定手段と、前記テキストが前記イグノラブル空白文字であって、当該イグノラブル空白文字に開始タグが後続する場合、当該イグノラブル空白文字を、当該開始タグを含む次の要素の構造に関連する前記整形用空白文字であるとして、当該構造を表す構造情報と対応付けて前記空白文字情報蓄積手段に蓄積する第１の空白文字判定手段と、前記テキストが前記イグノラブル空白文字でない場合、当該テキスト中に出現する空白文字列及び当該空白文字列と関連した構造の構造情報を、前記空白文字情報蓄積手段に蓄積されている情報と比較することにより、当該空白文字列が整形用空白文字であるかを判定する第２の空白文字判定手段と、前記テキストを含む前記構造化文書の検索に用いられる索引を、当該テキストから前記整形用空白文字と判定された空白文字列が除かれたテキスト部分に基づいて作成して前記構造化文書データベースに格納する索引作成手段と、前記テキストを含む前記構造化文書を前記構造化文書データベースに格納する文書格納処理手段とを具備する。 According to one aspect of the present invention, a structured document database management system is provided. This structured document database management system analyzes the structure of a structured document to be stored in a structured document database, in which a hierarchical structure is expressed using tags, and detects the text that appears after the tag And a blank character string that does not make sense as a structured document that appears in the structured document analyzed by the syntax analyzing means is associated with structural information representing a structure associated with the blank character string. Blank character information accumulating means for accumulating; an ignorable blank character determining means for determining whether the text detected by the syntax analyzing means is an ignorable blank character composed of only blank characters; and the text is the ignorable If it is a whitespace character followed by a start tag followed by the relevant whitespace character, the relevant whitespace character is A first blank character judging means for storing in the blank character information accumulating means in association with the structure information representing the structure as the formatting blank character related to the structure of the next element including the tag, and the text Is not the ignorable blank character, the blank character string appearing in the text and the structure information of the structure related to the blank character string are compared with the information stored in the blank character information storage means, Second blank character determination means for determining whether the blank character string is a formatting blank character, and an index used for searching the structured document including the text are determined as the formatting blank character from the text. An index creating means for creating the data based on the text part from which the blank character string is removed and storing it in the structured document database, and the structured including the text Comprising a document storage processing means for storing the write to the structured document database.

本発明によれば、構造化文書の構造の解析によって検出されたテキストから、表示上の整形に使用されている整形用空白文字であると判定された空白文字のみを取り除いて、当該テキストを含む構造化文書の検索に用いられる索引を作成することにより、整形用空白文字を挟む単語も文書として意味をなす単語の形で索引に登録することができる。 According to the present invention, from the text detected by analyzing the structure of the structured document, only the blank character determined to be the formatting blank character used for the formatting on the display is removed, and the text is included. By creating an index used to search for structured documents, words sandwiching white space for formatting can be registered in the index in the form of words that make sense as documents.

以下、本発明の実施の形態につき図面を参照して説明する。
図１は本発明の一実施形態に係るクライアント−サーバシステムのハードウェア構成を示すブロック図である。クライアント−サーバシステムは、主として、データベースサーバ（データベースサーバコンピュータ）１０と、複数のクライアント端末とから構成される。複数のクライアント端末はクライアント端末２０を含む。クライアント端末２０上では、データベースサーバ１０を利用するアプリケーション（アプリケーションプログラム）が動作する。クライアント端末２０を含む複数のクライアント端末は、ローカルエリアネットワーク（ＬＡＮ）のようなネットワーク３０を介してデータベースサーバ１０と接続されている。なお、図１にはクライアント端末２０以外のクライアント端末は省略されている。 Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 is a block diagram showing a hardware configuration of a client-server system according to an embodiment of the present invention. The client-server system mainly includes a database server (database server computer) 10 and a plurality of client terminals. The plurality of client terminals include a client terminal 20. On the client terminal 20, an application (application program) that uses the database server 10 operates. A plurality of client terminals including the client terminal 20 are connected to the database server 10 via a network 30 such as a local area network (LAN). In FIG. 1, client terminals other than the client terminal 20 are omitted.

データベースサーバ１０は、主メモリのようなメモリ１１を有している。データベースサーバ１０は、ハードディスクドライブのような外部記憶装置４０と接続されている。この外部記憶装置４０は、ＸＭＬデータベース管理プログラム４１及びＸＭＬデータベース４２を格納する。 The database server 10 has a memory 11 such as a main memory. The database server 10 is connected to an external storage device 40 such as a hard disk drive. The external storage device 40 stores an XML database management program 41 and an XML database 42.

ＸＭＬデータベース管理プログラム４１は、データベースサーバ１０によるＸＭＬデータベース４２の管理、及びクライアント端末からの検索要求に基づく検索処理に用いられる。ＸＭＬデータベース４２は検索の対象となる複数のＸＭＬ文書（ＸＭＬ形式の電子文書）を格納する。本実施形態においてＸＭＬデータベース４２は、ネットワーク３０を介して収集されたＸＭＬ文書の集合（ＸＭＬ文書集合）４２１を格納する。ＸＭＬデータベース４２はまた、当該ＸＭＬデータベース４２に格納されているＸＭＬ文書を検索するのに用いられる索引（索引情報）４２２を格納する。 The XML database management program 41 is used for management of the XML database 42 by the database server 10 and search processing based on a search request from a client terminal. The XML database 42 stores a plurality of XML documents (XML format electronic documents) to be searched. In the present embodiment, the XML database 42 stores a set of XML documents (XML document set) 421 collected via the network 30. The XML database 42 also stores an index (index information) 422 that is used to search XML documents stored in the XML database 42.

本実施形態では、データベースサーバ１０によってＸＭＬデータベース管理システム５０が実現される。図２はＸＭＬデータベース管理システム５０の主として機能構成を示すブロック図である。ＸＭＬデータベース管理システム５０は、文書読込部５１、構文解析部５２、空白文字判定部５３、空白文字一時格納部５４、空白文字情報蓄積部５５、索引作成部５６、文書格納処理部５７を含む。 In the present embodiment, an XML database management system 50 is realized by the database server 10. FIG. 2 is a block diagram mainly showing a functional configuration of the XML database management system 50. The XML database management system 50 includes a document reading unit 51, a syntax analysis unit 52, a blank character determination unit 53, a blank character temporary storage unit 54, a blank character information storage unit 55, an index creation unit 56, and a document storage processing unit 57.

文書読込部５１は、ＸＭＬデータベース４２に格納されるべきＸＭＬ文書を読み込む。本実施形態では、文書読込部５１は、ネットワーク３０上に公開されているＸＭＬ文書のうち、予め定められた条件に合致するＸＭＬ文書を読み込む。構文解析部５２は、文書読込部５１によって読み込まれたＸＭＬ文書のＸＭＬとしての構造（構文）を解析する。 The document reading unit 51 reads an XML document to be stored in the XML database 42. In the present embodiment, the document reading unit 51 reads an XML document that matches a predetermined condition from among XML documents published on the network 30. The syntax analysis unit 52 analyzes the XML structure (syntax) of the XML document read by the document reading unit 51.

空白文字判定部５３は、構文解析部５２によって解析された構造中に出現する空白文字のタイプを判定する。特に空白文字判定部５３は、空白文字が表示上の整形を行うための整形用空白文字であるかを判定する。空白文字判定部５３は、第１の空白文字判定部５３１、第２の空白文字判定部５３２及び分解部５３３を含む。 The blank character determination unit 53 determines the type of blank character that appears in the structure analyzed by the syntax analysis unit 52. In particular, the blank character determination unit 53 determines whether the blank character is a formatting blank character for formatting on display. The blank character determination unit 53 includes a first blank character determination unit 531, a second blank character determination unit 532, and a decomposition unit 533.

第１の空白文字判定部５３１は、構文解析部５２によって解析された構造中に出現する空白文字が、「ignorable white space（イグノラブル空白文字）」と呼ばれる、ＸＭＬ文書としては意味をなさない空白文字であるかを判定する。第１の空白文字判定部５３１は、「ignorable white space」と判定された空白文字を、表示上の整形を行うための整形用空白文字であると確定し、当該空白文字（ignorable white space）を、当該空白文字を要素内容として含む構造の構造情報と対応付けて空白文字一時格納部５４に格納する。 The first white space character determination unit 531 is a white space character that appears in the structure analyzed by the syntax analysis unit 52 and is called “ignorable white space”, and has no meaning as an XML document. It is determined whether it is. The first white space character determination unit 531 determines that the white space character determined as “ignorable white space” is a white space character for formatting for display formatting, and determines the white space character (ignorable white space). The blank character is stored in the blank character temporary storage unit 54 in association with the structure information of the structure including the element content.

第１の空白文字判定部５３１は、整形用空白文字（整形用空白文字列）であると確定されて空白文字一時格納部５４に格納された空白文字（空白文字列）の次に開始タグが現れた場合、当該空白文字を当該開始タグを含む次の要素（構造）に関連する整形用空白文字として、当該次の要素の構造情報と対応付けて空白文字情報蓄積部５５に蓄積する。 The first blank character determination unit 531 has a start tag next to the blank character (blank character string) that is determined to be a formatting blank character (formatting blank character string) and stored in the temporary blank character storage unit 54. When it appears, the blank character is stored in the blank character information storage unit 55 as the formatting blank character related to the next element (structure) including the start tag in association with the structure information of the next element.

分解部５３３は、構文解析部５２によってテキストが解析された場合、当該テキストを空白文字（空白文字列）と空白文字（空白文字列）を含まないテキストとに分解する。 When the text is analyzed by the syntax analysis unit 52, the decomposition unit 533 decomposes the text into text that does not include a blank character (blank character string) and a blank character (blank character string).

第２の空白文字判定部５３２は、分解部５３３によって分解されたテキスト内の空白文字（空白文字列）の位置、先行する空白文字（空白文字列）との関係、及び当該テキストを含む構造の１レベル上位の構造に関連する整形用空白文字（整形用空白文字列）の文字列長との関係に基づき、当該空白文字（空白文字列）が整形用空白文字（整形用空白文字列）であるか、整形用空白文字（整形用空白文字列）の候補であるか、単にテキスト中に出現する空白文字（空白文字列）であるかを判定する。 The second blank character determination unit 532 has a position of a blank character (blank character string) in the text decomposed by the decomposition unit 533, a relationship with a preceding blank character (blank character string), and a structure including the text. Based on the relationship with the string length of the formatting blank character (formatting blank character string) related to the structure one level above, the blank character (blank character string) is the formatting blank character (formatting blank character string). It is determined whether it is a candidate for white space for formatting (white space character string for formatting) or just a white space character (blank character string) that appears in the text.

第２の空白文字判定部５３２は、テキスト内の空白文字（空白文字列）が当該テキストの先頭に現れ、且つ当該空白文字（空白文字列）の文字列長が当該テキストを含む構造の１レベル上位の構造に関連する整形用空白文字（整形用空白文字列）の文字列長よりも長い場合、当該空白文字（空白文字列）を当該テキストを含む構造（該当構造）に関連する整形用空白文字（整形用空白文字列）と確定する。この場合、第２の空白文字判定部５３２は、確定された空白文字（空白文字列）を整形用空白文字（整形用空白文字列）として該当構造の構造情報と対応付けて空白文字情報蓄積部５５に蓄積する。 The second blank character determination unit 532 has a structure in which a blank character (blank character string) in the text appears at the head of the text, and the character string length of the blank character (blank character string) includes the text. Formatting space related to the structure (corresponding structure) that includes the text if the character string length of the formatting blank character (formatting blank character string) related to the upper structure is longer than the character string length It is determined as a character (blank string for formatting). In this case, the second blank character determination unit 532 associates the determined blank character (blank character string) with the structure information of the corresponding structure as the formatting blank character (formatting blank character string), and stores the blank character information storage unit. Accumulate in 55.

テキスト内に最初に現れる空白文字（空白文字列）の位置が、当該テキストの先頭でも末尾でもなく、且つ当該空白文字（空白文字列）の文字列長が当該テキストを含む構造の１レベル上位の構造に関連する整形用空白文字（整形用空白文字列）の文字列長よりも長い場合、第２の空白文字判定部５３２は、当該空白文字（空白文字列）を該当構造に関連する整形用空白文字（整形用空白文字列）の候補と判定する。この場合、第２の空白文字判定部５３２は、判定された空白文字（空白文字列）を整形用空白文字（整形用空白文字列）の候補として該当構造の構造情報と対応付けて空白文字情報蓄積部５５に蓄積する。 The position of the first blank character (blank character string) that appears in the text is not at the beginning or end of the text, and the character string length of the blank character (blank character string) is one level higher in the structure including the text. If the character string length of the formatting blank character (formatting blank character string) associated with the structure is longer than the character string length, the second blank character determination unit 532 uses the blank character (blank character string) for formatting associated with the corresponding structure. It is determined as a candidate for a blank character (formatting blank character string). In this case, the second blank character determination unit 532 associates the determined blank character (blank character string) with the structure information of the corresponding structure as a candidate for the blank character for formatting (blank character string for formatting), and blank character information. Accumulate in the accumulation unit 55.

第２の空白文字判定部５３２は、整形用空白文字（整形用空白文字列）として確定された、または整形用空白文字（整形用空白文字列）の候補として判定された空白文字（空白文字列）を含むテキスト内の当該空白文字（空白文字列）よりも後方の位置に当該空白文字（空白文字列）と同一の空白文字（空白文字列）が出現した場合、当該出現した空白文字（空白文字列）を整形用空白文字（整形用空白文字列）として判定、または整形用空白文字（整形用空白文字列）の候補を整形用空白文字（整形用空白文字列）として確定する。 The second blank character determination unit 532 determines a blank character (blank character string) determined as a formatting blank character (formatting blank character string) or determined as a candidate for a formatting blank character (formatting blank character string). ) If the same white space character (blank character string) as the white space character (blank character string) appears later than the white space character (blank character string) in the text containing (Character string) is determined as a formatting blank character (formatting blank character string), or a candidate for a formatting blank character (formatting blank character string) is determined as a formatting blank character (formatting blank character string).

空白文字一時格納部５４は、空白文字判定部５３によって整形用空白文字であると判定された空白文字列を、該当構造の構造情報と対応付けて一時格納するのに用いられる。空白文字一時格納部５４には、例えば図１のデータベースサーバ１０が有するメモリ１１上の第１の特定領域が割り当てられる。 The blank character temporary storage unit 54 is used to temporarily store the blank character string determined by the blank character determination unit 53 as a formatting blank character in association with the structure information of the corresponding structure. For example, a first specific area on the memory 11 of the database server 10 of FIG. 1 is allocated to the blank character temporary storage unit 54.

空白文字情報蓄積部５５は、空白文字判定部５３によって整形用空白文字または整形用空白文字の候補であると判定された空白文字列（つまりＸＭＬ文書としては意味をなさない空白文字列）を、当該空白文字列と関連する構造を表す構造情報と対応付けて蓄積するのに用いられる。空白文字情報蓄積部５５には、例えば上記メモリ１１上の第２の特定領域が割り当てられる。 The blank character information storage unit 55 determines a blank character string determined by the blank character determination unit 53 as a candidate for a formatting blank character or a formatting blank character (that is, a blank character string that does not make sense as an XML document). It is used for storing in association with the structure information representing the structure related to the blank character string. For example, a second specific area on the memory 11 is allocated to the blank character information storage unit 55.

索引作成部５６は、構文解析部５２によって構文が解析され、且つ整形用空白文字（として判定された空白文字列）が取り除かれたＸＭＬ文書に基づき、元のＸＭＬ文書（ＸＭＬ文書原文）を検索するのに用いられる索引（索引レコード）を作成する。文書格納処理部５７は、ＸＭＬ文書原文をＸＭＬデータベース４２に格納（登録）する。ＸＭＬデータベース４２に格納されるＸＭＬ文書では、空白文字を挟む前後の文字列は連続しているものとして扱われる。 The index creation unit 56 searches the original XML document (the original XML document) based on the XML document in which the syntax is analyzed by the syntax analysis unit 52 and the blank character for formatting (the blank character string determined as) is removed. Create an index (index record) that is used to The document storage processing unit 57 stores (registers) the XML document original text in the XML database 42. In the XML document stored in the XML database 42, the character strings before and after the blank character are treated as being continuous.

本実施形態において、文書読込部５１、構文解析部５２、空白文字判定部５３、索引作成部５６及び文書格納処理部５７の各機能部（処理部）は、図１のデータベースサーバ１０が外部記憶装置４０に格納されているＸＭＬデータベース管理プログラム４１を当該サーバ１０内のメモリ（図示せず）に読み込んで実行することにより実現されるものとする。このプログラム４１は、コンパクトディスク、或いはＲＯＭのような、コンピュータ読み取り可能な記憶媒体に予め格納して頒布可能である。また、このプログラム４１が、ネットワーク３０を介してデータベースサーバ１０にダウンロードされても構わない。 In the present embodiment, each function unit (processing unit) of the document reading unit 51, the syntax analysis unit 52, the blank character determination unit 53, the index creation unit 56, and the document storage processing unit 57 is stored in the database server 10 of FIG. The XML database management program 41 stored in the device 40 is assumed to be realized by reading it into a memory (not shown) in the server 10 and executing it. The program 41 can be stored in advance in a computer-readable storage medium such as a compact disk or ROM and can be distributed. Further, this program 41 may be downloaded to the database server 10 via the network 30.

図３は、ＸＭＬデータベース４２に格納される索引４２２のデータ構造例をＸＭＬ文書集合４２１中のＸＭＬ文書と対応付けて示す。索引４２２は、例えばＢＴｒｅｅを用いてＸＭＬデータベース４２に格納される。索引４２２内では文字列（索引文字列）は当該文字列の順序に従って昇順に整列され、ページと呼ばれる複数の領域に分割して格納されている。ページは外部記憶装置４０から／へのデータの読み出し／書き込みの単位である。 FIG. 3 shows an example of the data structure of the index 422 stored in the XML database 42 in association with the XML documents in the XML document set 421. The index 422 is stored in the XML database 42 using, for example, BTree. In the index 422, character strings (index character strings) are arranged in ascending order according to the order of the character strings, and are divided and stored in a plurality of areas called pages. A page is a unit for reading / writing data from / to the external storage device 40.

図３では、索引４２２を構成する索引ページ群の１つが索引ページ４２２ａとして示されている。索引ページ４２２ａは、キーワード（を構成する文字列）毎に、そのキーワードの出現位置の情報を持つレコード（索引レコード）の集合からなる。この出現位置の情報には、例えば対応するキーワードが出現するＸＭＬ文書（ＸＭＬ文書原文）のＩＤが用いられる。 In FIG. 3, one of index page groups constituting the index 422 is shown as an index page 422a. The index page 422a is composed of a set of records (index records) having information on the appearance position of the keyword for each keyword (character string constituting the keyword). For example, the ID of the XML document (XML document original text) in which the corresponding keyword appears is used as the appearance position information.

次に、図２のＸＭＬデータベース管理システム５０においてＸＭＬ文書をＸＭＬデータベース４２に格納する際に実行される、整形用空白文字判定を含む構文解析処理について図４Ａ及び図４Ｂのフローチャートを参照して説明する。 Next, the parsing process including the blank character determination for formatting, which is executed when the XML document is stored in the XML database 42 in the XML database management system 50 of FIG. 2, will be described with reference to the flowcharts of FIGS. 4A and 4B. To do.

文書読込部５１は、Ｗｅｂサーバやファイルサーバによってネットワーク３０上に公開されているＸＭＬ文書のうち予め定められた条件に合致するＸＭＬ文書（目的とするＸＭＬ文書）を、ＸＭＬデータベース４２に格納されるべきＸＭＬ文書として、例えば図１のデータベースサーバ１０が有するメモリ１１（ＸＭＬデータベース管理システム５０内のメモリ１１）に読み込む文書読込（ステップＳ１）を開始する。 The document reading unit 51 stores, in the XML database 42, an XML document (target XML document) that matches a predetermined condition among XML documents published on the network 30 by a Web server or a file server. As a power XML document, for example, document reading (step S1) to be read into the memory 11 (memory 11 in the XML database management system 50) of the database server 10 of FIG. 1 is started.

図５は、文書読込部５１によって読み込みが開始されたＸＭＬ文書（第１のＸＭＬ文書）の一例を示す。
ＸＭＬ文書（構造化文書）は、周知のように、要素という単位で構成されている。要素は、データに内容を表す名前を付けてタグで挟んだものである。この名前（つまり要素名）は、タグ内に記述される。ＸＭＬ文書では、要素の中に要素を含める、いわゆる入れ子構造とすることもできる。 FIG. 5 shows an example of an XML document (first XML document) that has been read by the document reading unit 51.
As is well known, an XML document (structured document) is configured in units of elements. An element is a name that represents the contents of data and is sandwiched between tags. This name (namely, element name) is described in the tag. In an XML document, a so-called nested structure in which an element is included in an element can be used.

図５に示すＸＭＬ文書の例では、文書全体が<特許>タグで挟まれた要素であり、その名称を示す<名称>タグに挟まれた「情報処理装置」が、当該文書の構成内容の１つである特許の名称であることを示している。 In the example of the XML document shown in FIG. 5, the entire document is an element sandwiched between <patent> tags, and an “information processing apparatus” sandwiched between <name> tags indicating the name of the document includes the contents of the document. It shows that it is the name of one patent.

図５に示されるＸＭＬ文書では、例えば当該ＸＭＬ文書が画面表示されている状態で人が当該文書を見やすいように、<名称>タグ及び<要旨>タグの開始位置（開始タグの位置）が、構造上の上位をなす<特許>タグの開始位置よりも、タブ文字によって１段下げて記述されている。また、<要旨>タグ内の文章、つまり<要旨>タグを含む要素（要旨要素）の内容（値）である文章（テキスト）は、２文字のタブ文字の挿入によって<要旨>タグの開始位置よりも更に１段下げて記述されている。このように、図５に示されるＸＭＬ文書は空白文字を含む。 In the XML document shown in FIG. 5, for example, the start position of the <name> tag and the <summary> tag (position of the start tag) is set so that a person can easily see the document while the XML document is displayed on the screen. It is described with a tab character one level lower than the starting position of the <patent> tag, which is the top of the structure. In addition, the text in the <summary> tag, that is, the text (text) that is the content (value) of the element (summary element) containing the <summary> tag, is inserted into the <summary> tag. It is described with one step lower than that. As described above, the XML document shown in FIG. 5 includes a blank character.

これらの空白文字（つまり改行文字及びタブ文字）の存在を理解し易くするために、例えば「↓」で改行文字、「ｔ」で１文字のタブ文字を示すならば、図５に示されるＸＭＬ文書は、ＸＭＬデータベース管理システム５０（データベースサーバ１０）によって処理される際に以下のような文書
「<特許>↓ｔ<名称>情報処理装置</名称>↓ｔ<要旨>↓ｔｔこの装置は、文書を読み込んで、その構造を解析し、取り出した構造情↓ｔｔ報と原文をともに外部記憶装置に蓄積する。ユーザはキーワードにより↓ｔｔ必要な情報を取り出すことができる。↓ｔ</要旨>↓</特許>」
として扱われる。
図６は、図５に示すＸＭＬ文書を、当該ＸＭＬ文書に含まれる改行文字及びタブ文字が記号「↓」及び「ｔ」で置き換えられた形式で示す。 In order to make it easy to understand the existence of these white space characters (that is, a line feed character and a tab character), for example, if “↓” indicates a line feed character and “t” indicates one tab character, the XML shown in FIG. When the document is processed by the XML database management system 50 (database server 10), the following document “<patent> ↓ t <name> information processing apparatus </ name> ↓ t <summary> ↓ tt The document is read, the structure is analyzed, and the extracted structure information ↓ tt and the original text are stored in the external storage device.The user can extract the necessary information ↓ tt by keyword ↓ t </ summary > ↓ </ patent>
Are treated as
FIG. 6 shows the XML document shown in FIG. 5 in a format in which the line feed character and the tab character included in the XML document are replaced with symbols “↓” and “t”.

さて文書読込部５１は、図６（図５）に示されるＸＭＬ文書（目的とするＸＭＬ文書）を対象とする文書読込（ステップＳ１）を、当該ＸＭＬ文書の終端に達して読み込むべきデータがなくなるまで、つまり当該ＸＭＬ文書が終了するまで繰り返す。 Now, the document reading unit 51 reaches the end of the XML document, and there is no data to be read in the document reading (step S1) for the XML document (target XML document) shown in FIG. 6 (FIG. 5). The process is repeated until the XML document ends.

一方、構文解析部５２は、目的とする文書が終了するまで（ステップＳ２）、文書読込部５１によってメモリ１１に読み込まれた文書データの構造を解析する（ステップＳ３）。 On the other hand, the syntax analysis unit 52 analyzes the structure of the document data read into the memory 11 by the document reading unit 51 until the target document ends (step S2) (step S3).

一般に、ＸＭＬ文書内の単語を取り出して索引（索引レコード）を作成するためには、隣接した文字を組み合わせて単語として認識する必要がある。しかし、上述のＸＭＬ文書の例では、「構造情報」の中に含まれる「情報」という単語の中（文字「情」と文字「報」との間）に空白文字（改行文字「↓」及び２つのタブ文字「ｔ」）が挿入されている。このため、例えば単語「構造情報」（または単語「情報」）は索引として取り出されない。そこで本実施形態では、以下に述べるように、文字列「構造情↓ｔｔ報」から「構造情報」を索引として抽出することを可能としている。 Generally, in order to take out words in an XML document and create an index (index record), it is necessary to combine adjacent characters and recognize them as words. However, in the example of the XML document described above, a blank character (new line character “↓”) and a character “information” included in “structure information” (between the characters “information” and “information”) Two tab characters “t”) are inserted. For this reason, for example, the word “structure information” (or the word “information”) is not extracted as an index. Therefore, in this embodiment, as described below, it is possible to extract “structure information” as an index from the character string “structure information ↓ tt information”.

構文解析部５２は例えばＸＭＬ文書解析機能を有するＸＭＬパーサである。本実施形態において構文解析部５２は、文書読込部５１によってメモリ１１に読み込まれたＸＭＬ文書を、ＳＡＸ（Simple API for XML）２と呼ばれる周知のアプリケーションインタフェースを用いて取り込んで解析する。この場合、要素の開始(開始タグ)、要素の終了(終了タグ)、文字列（テキスト）及び「ignorable white space」の単位で構文解析部５２に通知される。構文解析部５２は、この通知に従い、開始タグ、終了タグ、テキスト及び「ignorable white space」の単位で、図５に示されるＸＭＬ文書（目的とするＸＭＬ文書）を対象とする構文解析（構造解析）を、当該ＸＭＬ文書の終端まで繰り返す（ステップＳ２，Ｓ３）。 The syntax analysis unit 52 is, for example, an XML parser having an XML document analysis function. In the present embodiment, the syntax analysis unit 52 imports and analyzes the XML document read into the memory 11 by the document reading unit 51 using a known application interface called SAX (Simple API for XML) 2. In this case, the syntax analysis unit 52 is notified in units of element start (start tag), element end (end tag), character string (text), and “ignorable white space”. In accordance with this notification, the syntax analysis unit 52 performs syntax analysis (structural analysis) on the XML document (target XML document) shown in FIG. 5 in units of start tag, end tag, text, and “ignorable white space”. ) Is repeated until the end of the XML document (steps S2 and S3).

図６のＸＭＬ文書の例では、当該ＸＭＬ文書は、以下に整理して示す文書データ
(1) <特許> ：最上位（１段目）の構造（階層構造）の要素を示す開始タグ
(2) ↓ｔ：<特許>要素の内容であるが、「ignorable white space」である
(3) <名称> ：２段目の構造の要素を示す開始タグ
(4) 情報処理装置：<名称>要素の内容であるテキスト
(5) </名称> ：２段目の構造の要素の終わりを示す終了タグ
(6) ↓ｔ：<特許>要素の内容であるが、「ignorable white space」である
(7) <要旨> ：２段目の構造の要素を示す開始タグ
(8) ↓ｔｔこの装置は、文書を読み込んで、その構造を解析し、取り出した構造情↓ｔｔ報と原文をともに外部記憶装置に蓄積する。ユーザはキーワードにより↓ｔｔ必要な情報を取り出すことができる。↓ｔ
：<要旨>要素の内容であるテキスト（以下、テキストＴと表現）
(9) </要旨> ：２段目の構造の要素の終わり示す終了タグ
(10) ↓ ：<特許>要素の内容であるが、「ignorable white space」である
(11) </特許> ：最上位の構造の要素の終わりを示す終了タグ
に分解される。ここには、文書データの右側に、当該データに対応する上記単位に関する説明を付してある。 In the example of the XML document in FIG. 6, the XML document is document data shown in the following order.
(1) <Patent>: Start tag indicating the element of the highest level (first level) structure (hierarchical structure)
(2) ↓ t: The content of the <patent> element, but “ignorable white space”
(3) <Name>: Start tag indicating the element of the second level structure
(4) Information processing device: Text that is the contents of the <name> element
(5) </ Name>: End tag that indicates the end of the element of the second level structure
(6) ↓ t: The content of the <patent> element, but “ignorable white space”
(7) <Summary>: Start tag indicating the element of the second level structure
(8) ↓ tt This device reads a document, analyzes its structure, and stores both the extracted structure information ↓ tt information and the original text in an external storage device. The user can take out the necessary information ↓ tt by keyword. ↓ t
: <Summary> Text that is the content of the element (hereinafter referred to as text T)
(9) </ Summary>: End tag that indicates the end of the element in the second stage structure
(10) ↓: The content of the <patent> element is "ignorable white space"
(11) </ patent>: Decomposed into an end tag indicating the end of the element of the topmost structure. Here, a description of the unit corresponding to the data is attached to the right side of the document data.

文書格納処理部５７は、構文解析部５２によって分解された文書データを、ＸＭＬ文書集合４２１の一部として、文書読込部５１によって読み込まれた原文の通りにＸＭＬデータベース４２に格納する。 The document storage processing unit 57 stores the document data decomposed by the syntax analysis unit 52 in the XML database 42 as a part of the XML document set 421 as the original text read by the document reading unit 51.

ＸＭＬ文書では、「white space」と呼ばれる空白文字（空白記号）として、周知のように、スペース（半角スペース）、タブ、改行文字（ＬＦ（Line Feed））及び復帰文字（ＣＲ（Carriage Return））が定義されている。改行文字及び復帰文字は、改行を表す空白文字である。 In an XML document, as is well known, as a white space character (white space symbol) called “white space”, a space (half-width space), a tab, a line feed character (LF (Line Feed)), and a return character (CR (Carriage Return)) Is defined. A line feed character and a carriage return character are blank characters representing a line feed.

上述の空白文字（white space）は、例えばタグ内で要素名と属性名とを区切るといった、区別するものの間に挿入されて使用されるのが一般的である。しかしながら、タグ同士の見分けを付けやすくしたり、読みやすくするために、つまり整形用に空白文字が挿入されることもある。ＸＭＬでは、このようなＸＭＬの構造に関係なく、意味のあるテキストデータ(要素の構成要素である内容を示すもの)ではない空白文字は、「ignorable white space（イグノラブル空白文字）」と定義されている。「ignorable white space」は、通常はタグ（終了タグ）とタグ（開始タグ）との間に挟まれた空白文字のみで構成されるテキストデータである。 The above-described white space is generally used by being inserted between distinguishing elements such as separating element names and attribute names in tags. However, in order to make it easy to distinguish between tags or to make them easier to read, a blank character may be inserted for formatting. In XML, regardless of the XML structure, white space characters that are not meaningful text data (indicating the content that is the component of the element) are defined as “ignorable white space”. Yes. The “ignorable white space” is text data usually composed of only white space characters sandwiched between a tag (end tag) and a tag (start tag).

本実施形態において、空白文字判定部５３は、構文解析部５２による構文解析の結果を取得して、当該構造解析結果に基づき空白文字（特に整形用空白文字）を判定する。空白文字判定部５３は、整形用空白文字であると判定された空白文字を、当該空白文字と関連する構造を表す構造情報と対応付けて空白文字情報蓄積部５５に蓄積する。 In the present embodiment, the blank character determination unit 53 acquires the result of syntax analysis by the syntax analysis unit 52 and determines a blank character (particularly, a formatting blank character) based on the structure analysis result. The blank character determination unit 53 stores the blank character determined to be the formatting blank character in the blank character information storage unit 55 in association with the structure information indicating the structure related to the blank character.

本実施形態では、以下の２種類が整形用空白文字（または整形用空白文字の候補）として扱われる。 In the present embodiment, the following two types are treated as formatting blank characters (or formatting blank character candidates).

第１は「ignorable white space」である。更に具体的に述べるならば、「ignorable white space」に後続する構造が開始タグである場合、当該開始タグで表現される構造（開始タグを含む要素の構造）に対する整形用空白文字は、その直前に現れた「ignorable white space」である。 The first is “ignorable white space”. More specifically, when the structure following “ignorable white space” is a start tag, the formatting white space character for the structure represented by the start tag (the structure of the element including the start tag) is immediately before it. The “ignorable white space” that appeared in

第２はテキスト内に存在する空白文字列であって、且つ当該テキストを示す構造の１レベル上の構造に対する整形用空白文字列よりも長い空白文字列である場合、当該テキストを示す構造に対する整形用空白文字は、当該テキスト内に存在する空白文字列である。 Second, if a blank character string exists in the text and is a blank character string longer than a formatting blank character string for a structure one level higher than the structure indicating the text, the formatting for the structure indicating the text is performed. The blank character for use is a blank character string existing in the text.

以下、構文解析部５２による構文解析結果に基づく整形用空白文字の判定を含む処理の詳細を、図６のＸＭＬ文書（第１のＸＭＬ文書）が上述の文書データ(1)〜(11)に分解して解析される場合について、図４Ａ及び図４Ｂのフローチャートに従い、図７及び図８を参照して順次説明する。図７は空白文字情報蓄積部５５の内容の変化を示す図、図８は空白文字一時格納部５４の内容の変化を示す図である。図７及び図８並びに以下の説明では、解析された構造を表す構造情報をＸＰａｔｈの形式で表す。なお、構造情報として、対応する構造の要素（ノード）に固有のＩＤ（ノードＩＤ）を用いることも可能である
(1) <特許>（/特許）
ＸＭＬ文書の先頭の<特許>タグは、構造情報「/特許」で示される<特許>要素の開始を示す開始タグである（ステップＳ４，Ｓ５）。この場合、空白文字判定部５３内の第１の空白文字判定部５３１は構造情報「/特許」に先行する整形用空白文字が存在するならば、当該構造情報「/特許」を整形用空白文字と対応付けて空白文字情報蓄積部５５に蓄積する（ステップＳ６）。しかし、本実施形態において、構造情報「/特許」は行の先頭であり、当該構造情報「/特許」に先行する整形用空白文字が存在しない。このような場合、第１の空白文字判定部５３１は、構造情報「/特許」のみを空白文字情報蓄積部５５の先頭エントリに蓄積する。 Hereinafter, the details of the processing including the determination of the blank character for formatting based on the result of the syntax analysis by the syntax analysis unit 52 will be described in the XML document (first XML document) in FIG. 6 in the above document data (1) to (11). A case where the analysis is performed after being decomposed will be sequentially described with reference to FIGS. 7 and 8 according to the flowcharts of FIGS. 4A and 4B. FIG. 7 is a diagram showing changes in the contents of the blank character information storage unit 55, and FIG. 8 is a diagram showing changes in the contents of the blank character temporary storage unit 54. In FIG. 7 and FIG. 8 and the following description, the structure information representing the analyzed structure is represented in the XPath format. As structure information, an ID (node ID) unique to an element (node) of the corresponding structure can be used.
(1) <patent> (/ patent)
The <patent> tag at the top of the XML document is a start tag indicating the start of the <patent> element indicated by the structure information “/ patent” (steps S4 and S5). In this case, if there is a formatting blank character preceding the structure information “/ patent”, the first blank character determining unit 531 in the blank character determining unit 53 converts the structure information “/ patent” into the formatting blank character. Are stored in the blank character information storage unit 55 in association with (step S6). However, in the present embodiment, the structure information “/ patent” is the head of the line, and there is no formatting blank character preceding the structure information “/ patent”. In such a case, the first blank character determination unit 531 accumulates only the structure information “/ patent” in the leading entry of the blank character information accumulation unit 55.

図７（ａ）は、このときの空白文字情報蓄積部５５の内容を示す。この空白文字情報蓄積部５５のエントリには、構造情報及び整形用空白文字（であると判定された空白文字）と組をなしてフラグ（フラグ情報）が格納される。このエントリ内のフラグは、当該エントリに蓄積されている空白文字（空白文字列）が整形用空白文字（整形用空白文字列）として確定されているか否か（整形用空白文字の候補であるか）を示す。但し、本実施形態では、図７（ａ）の例のように整形用空白文字が存在しないエントリ中のフラグは、便宜的に「確定」を示す状態に設定される。 FIG. 7A shows the contents of the blank character information storage unit 55 at this time. The entry of the blank character information storage unit 55 stores a flag (flag information) in pairs with the structure information and the formatting blank character (the blank character determined to be). The flag in this entry indicates whether or not the blank character (blank character string) accumulated in the entry is confirmed as a formatting blank character (formatting blank character string) (whether it is a candidate for a formatting blank character). ). However, in the present embodiment, as in the example of FIG. 7A, a flag in an entry that does not have a formatting blank character is set to a state indicating “determined” for convenience.

(2) ↓ｔ
構文解析部５２は、構造情報「/特許」が空白文字情報蓄積部５５に蓄積されると（ステップＳ６）、<特許>タグに後続する文書データの構造解析を行う（ステップＳ２，Ｓ３）。<特許>タグに後続する文書データは「↓ｔ」である。この「↓ｔ」は<特許>要素の内容（の一部）であり、「ignorable white space」である（ステップＳ４，Ｓ８）。 (2) ↓ t
When the structure information “/ patent” is accumulated in the blank character information accumulation unit 55 (step S6), the syntax analysis unit 52 analyzes the structure of the document data following the <patent> tag (steps S2 and S3). The document data following the <patent> tag is “↓ t”. This “↓ t” is (part of) the content of the <patent> element and is “ignorable white space” (steps S4 and S8).

この場合、第１の空白文字判定部５３１は、「↓ｔ」が次に出現する要素に関連する整形用空白文字（整形用空白文字列）であると判定する。そこで第１の空白文字判定部５３１は、「↓ｔ」を整形用空白文字として、当該「↓ｔ」を内容（の一部）とする<特許>要素の構造情報（現在処理対象となっている階層の構造情報）「/特許」に対応付けて空白文字一時格納部５４に格納する（ステップＳ９）。図８（ａ）は、このときの空白文字一時格納部５４の内容を示す。なお、ステップＳ８の判定（「ignorable white space」であるかの判定）が第１の空白文字判定部５３１によって行われても、構文解析部５２及び第１の空白文字判定部５３１から独立の判定部（「ignorable white space」判定部）によって行われても構わない。 In this case, the first blank character determination unit 531 determines that “↓ t” is a formatting blank character (formatting blank character string) related to the element that appears next. Therefore, the first blank character determination unit 531 uses <↓> as a formatting blank character, and <patent> element structure information (currently a processing target) with the content of (↓ t) as “↓ t”. The information is stored in the blank character temporary storage unit 54 in association with “/ patent” (step S9). FIG. 8A shows the contents of the blank character temporary storage unit 54 at this time. Even if the determination in step S8 (determination of “ignorable white space”) is performed by the first blank character determination unit 531, the determination is independent from the syntax analysis unit 52 and the first blank character determination unit 531. Part ("ignorable white space" determination part).

(3) <名称>（/特許/名称）
構文解析部５２は、整形用空白文字「↓ｔ」が構造情報「/特許」に対応付けて空白文字一時格納部５４に格納されると（ステップＳ９）、「↓ｔ」に後続する文書データの構造解析を行う（ステップＳ２，Ｓ３）。「↓ｔ」に後続する文書データは<名称>タグである。 (3) <Name> (/ Patent / Name)
When the formatting blank character “↓ t” is associated with the structure information “/ patent” and stored in the temporary blank character storage unit 54 (step S9), the syntax analysis unit 52 stores the document data subsequent to “↓ t”. The structural analysis is performed (steps S2 and S3). The document data following “↓ t” is a <name> tag.

<名称>タグは、構造情報「/特許/名称」で示される<名称>要素の開始を示す開始タグである（ステップＳ４，Ｓ５）。この場合、第１の空白文字判定部５３１は、空白文字一時格納部５４を参照することにより、構造情報「/特許/名称」が前回取得された文書データである整形用空白文字（ここでは「↓ｔ」）に後続して記述されていること、つまり構造情報「/特許/名称」に先行する整形用空白文字（「↓ｔ」）が存在することを認識する。そこで第１の空白文字判定部５３１は、構造情報「/特許/名称」が、前回取得された整形用空白文字「↓ｔ」に後続して記述されていることを示すために、構造情報「/特許/名称」を先行する整形用空白文字「↓ｔ」と対応付けて空白文字情報蓄積部５５の２番目のエントリに蓄積する（ステップＳ６）。このとき第１の空白文字判定部５３１は、整形用空白文字「↓ｔ」が蓄積された空白文字情報蓄積部５５の２番目のエントリのフラグを、「確定」を示す状態に設定する。図７（ｂ）は、このときの空白文字情報蓄積部５５の内容を示す。 The <name> tag is a start tag indicating the start of the <name> element indicated by the structure information “/ patent / name” (steps S4 and S5). In this case, the first blank character determination unit 531 refers to the blank character temporary storage unit 54, so that the formatting blank character (in this case, “/ patent / name” is the document data obtained previously). ↓ t ”), that is, that there is a formatting blank character (“ ↓ t ”) preceding the structure information“ / patent / name ”. Therefore, the first blank character determination unit 531 displays the structure information “/ patent / name” to indicate that the structure information “/ patent / name” is described subsequent to the previously obtained blank character for formatting “↓ t”. “/ Patent / name” is associated with the preceding formatting blank character “↓ t” and stored in the second entry of the blank character information storage unit 55 (step S6). At this time, the first blank character determination unit 531 sets the flag of the second entry of the blank character information storage unit 55 in which the formatting blank character “↓ t” is stored to a state indicating “confirmed”. FIG. 7B shows the contents of the blank character information storage unit 55 at this time.

(4) 情報処理装置（テキスト）
<名称>タグに後続する文書データはテキスト「情報処理装置」である（ステップＳ４，Ｓ８）。この場合、空白文字判定部５３内の分解部５３３は、テキストを先頭から順に、空白文字列と、空白文字列を含まないテキスト（テキスト部分）とに分解する（ステップＳ１０，Ｓ１１）。 (4) Information processing device (text)
The document data following the <name> tag is the text “information processing apparatus” (steps S4 and S8). In this case, the decomposition unit 533 in the blank character determination unit 53 decomposes the text into a blank character string and a text (text part) not including the blank character string in order from the top (steps S10 and S11).

ここで、テキスト「情報処理装置」は、空白文字列を含まない（ステップＳ１２，Ｓ１３）。この場合、空白文字判定部５３内の第２の空白文字判定部５３２は、「情報処理装置」全体を意味のあるテキストであると判定する（ステップＳ１４）。ステップＳ１４で判定されたテキスト「情報処理装置」が、整形用空白文字を含まないことは明らかである。第２の空白文字判定部５３２は、ステップＳ１４で判定された意味のあるテキスト（つまり空白文字列を含まないテキスト）「情報処理装置」をメモリ１１に確保されたテキスト格納部（図示せず）に格納する。 Here, the text “information processing apparatus” does not include a blank character string (steps S12 and S13). In this case, the second blank character determining unit 532 in the blank character determining unit 53 determines that the entire “information processing apparatus” is meaningful text (step S14). It is clear that the text “information processing apparatus” determined in step S14 does not include a formatting blank character. The second blank character determination unit 532 is a text storage unit (not shown) in which the meaningful information determined in step S14 (that is, text that does not include a blank character string) “information processing apparatus” is secured in the memory 11. To store.

(5) </名称>（「/特許/名称」の終了）
</名称>は２段目の階層の構造情報「/特許/名称」で示される<名称>要素の終了を示す終了タグである（ステップＳ４，Ｓ５）。この場合、構文解析部５２は、処理対象の階層（対象階層）を１レベル上げる（ステップＳ７）。これにより処理対象となる階層の構造情報は「/特許」となる。そして構文解析部５２は、</名称>タグに後続する文書データの構造解析を行う（ステップＳ２，Ｓ３）。 (5) </ Name> (End of "/ Patent / Name")
</ Name> is an end tag indicating the end of the <name> element indicated by the structure information “/ patent / name” of the second level hierarchy (steps S4 and S5). In this case, the syntax analysis unit 52 raises the processing target hierarchy (target hierarchy) by one level (step S7). As a result, the structure information of the hierarchy to be processed becomes “/ patent”. Then, the syntax analysis unit 52 analyzes the structure of the document data following the </ name> tag (steps S2 and S3).

(6) ↓ｔ
</名称>タグに後続する文書データは「↓ｔ」である。この「↓ｔ」は、現在の処理対象階層をなす<特許>要素の内容（の一部）であり、「ignorable white space」である（ステップＳ４，Ｓ８）。そこで第１の空白文字判定部５３１は、この「ignorable white space」つまり「↓ｔ」を次に出現する要素の整形用空白文字であるとして、当該「↓ｔ」を現在の処理対象階層の構造情報「/特許」に対応付けて空白文字一時格納部５４に格納する（ステップＳ９）。このときの空白文字一時格納部５４の内容は、図８（ａ）と同一となる。 (6) ↓ t
The document data following the </ name> tag is “↓ t”. This “↓ t” is the content of (part of) the <patent> element constituting the current processing target hierarchy, and is “ignorable white space” (steps S4 and S8). Therefore, the first blank character determination unit 531 assumes that “ignorable white space”, that is, “↓ t” is a blank character for formatting the element that appears next, and regards “↓ t” as the structure of the current processing target hierarchy. It is stored in the blank character temporary storage unit 54 in association with the information “/ patent” (step S9). The contents of the blank character temporary storage unit 54 at this time are the same as those in FIG.

(7) <要旨>（/特許/要旨）
整形用空白文字と判定された「↓ｔ」（ignorable white space）に後続する文書データは<要旨>タグである。<要旨>タグは、構造情報「/特許/要旨」で示される<要旨>要素の開始を示す開始タグである（ステップＳ４，Ｓ５）。 (7) <Abstract> (/ Patent / Abstract)
The document data following the “↓ t” (ignorable white space) determined to be a formatting blank character is a <summary> tag. The <summary> tag is a start tag indicating the start of the <summary> element indicated by the structure information "/ patent / abstract" (steps S4 and S5).

この場合、第１の空白文字判定部５３１は、空白文字一時格納部５４を参照することにより、構造情報「/特許/要旨」が前回取得された文書データである整形用空白文字（ここでは「↓ｔ」）に後続して記述されていることを認識する。そこで第１の空白文字判定部５３１は、構造情報「/特許/要旨」が、前回取得された整形用空白文字「↓ｔ」に後続して記述されていることを示すために、当該構造情報「/特許/要旨」を先行する整形用空白文字「↓ｔ」と対応付けて空白文字情報蓄積部５５の３番目のエントリに蓄積する（ステップＳ６）。このとき第１の空白文字判定部５３１は、整形用空白文字「↓ｔ」が蓄積された空白文字情報蓄積部５５の３番目のエントリのフラグを、「確定」を示す状態に設定する。図７（ｃ）は、このときの空白文字情報蓄積部５５の内容を示す。 In this case, the first blank character determination unit 531 refers to the blank character temporary storage unit 54, so that the formatting blank character (in this case, “/ patent / abstract” is the document data obtained last time. ↓ t ”) is recognized following the description. Therefore, the first blank character determination unit 531 displays the structure information “/ patent / abstract” to indicate that the structure information “/ patent / abstract” is described subsequent to the previously obtained blank character for formatting “↓ t”. “/ Patent / abstract” is associated with the preceding formatting blank character “↓ t” and stored in the third entry of the blank character information storage unit 55 (step S6). At this time, the first blank character determination unit 531 sets the flag of the third entry of the blank character information storage unit 55 in which the formatting blank character “↓ t” is stored to a state indicating “confirmed”. FIG. 7C shows the contents of the blank character information storage unit 55 at this time.

(8) Ｔ（テキスト）
<要旨>タグに後続する文書データは、テキストＴ、つまり
「↓ｔｔこの装置は、文書を読み込んで、その構造を解析し、取り出した構造情↓ｔｔ報と原文をともに外部記憶装置に蓄積する。ユーザはキーワードにより↓ｔｔ必要な情報を取り出すことができる。↓ｔ」
である（ステップＳ４，Ｓ８）。この場合、分解部５３３は、テキストＴを先頭から順に、空白文字列と、空白文字列を含まないテキスト部分（以下、意味のあるテキストと称する）とに分解する（ステップＳ１０，Ｓ１１）。 (8) T (text)
<Summary> The document data following the tag is text T, that is, "↓ tt This device reads the document, analyzes its structure, and stores the retrieved structure information ↓ tt information and original text in an external storage device. The user can take out the necessary information ↓ tt by keyword. ↓ t "
(Steps S4 and S8). In this case, the decomposition unit 533 decomposes the text T in order from the top into a blank character string and a text portion that does not include the blank character string (hereinafter referred to as meaningful text) (steps S10 and S11).

ここでは、テキストＴは、
8-1) 空白文字列「↓ｔｔ」
8-2) 意味のあるテキスト「この装置は、文書を読み込んで、その構造を解析し、取り出した構造情」（以下、意味のあるテキストＴ１と称する）
8-3) 空白文字列「↓ｔｔ」
8-4) 意味のあるテキスト「報と原文をともに外部記憶装置に蓄積する。ユーザはキーワードにより」（以下、意味のあるテキストＴ２と称する）
8-5) 空白文字列「↓ｔｔ」
8-6) 意味のあるテキスト「必要な情報を取り出すことができる。」（以下、意味のあるテキストＴ３と称する）
8-7) 空白文字列「↓ｔ」
に分解される。 Here, the text T is
8-1) Blank character string “↓ tt”
8-2) Meaningful text “This device reads the document, analyzes its structure, and extracts it” (hereinafter referred to as meaningful text T1)
8-3) Blank character string “↓ tt”
8-4) Meaningful text “Both information and original text are stored in an external storage device. The user uses a keyword” (hereinafter referred to as meaningful text T2).
8-5) Blank character string “↓ tt”
8-6) Meaningful text “Necessary information can be extracted” (hereinafter referred to as meaningful text T3)
8-7) Blank character string “↓ t”
Is broken down into

以下、テキストＴが分解された上述の8-1)乃至8-7)の各データに対する処理について、順次説明する。 Hereinafter, processing for each data of the above-described 8-1) to 8-7) in which the text T is decomposed will be sequentially described.

8-1) 空白文字列「↓ｔｔ」
この空白文字列「↓ｔｔ」は、テキストＴの先頭に現れる（ステップＳ１５）。また、この空白文字列「↓ｔｔ」は、テキストＴを含む構造を表す構造情報「/特許/要旨/text()」の１レベル上位の構造を表す構造情報「/特許/要旨」と空白文字情報蓄積部５５内の３番目のエントリで対応付けられている整形用空白文字「↓ｔ」（図７（ｃ）参照）よりも文字列長が長い。この場合、第２の空白文字判定部５３２は、空白文字列「↓ｔｔ」をテキストＴの整形用空白文字であると確定する（ステップＳ１６）。このステップＳ１６において第２の空白文字判定部５３２は、整形用空白文字と確定された「↓ｔｔ」を、テキストＴを含む構造を表す構造情報「/特許/要旨/text()」に対応付けて、空白文字情報蓄積部５５に蓄積する。図７（ｄ）は、このときの空白文字情報蓄積部５５の内容を示す。 8-1) Blank character string “↓ tt”
This blank character string “↓ tt” appears at the beginning of the text T (step S15). In addition, this blank character string “↓ tt” includes the structure information “/ patent / abstract” representing the structure one level higher than the structure information “/ patent / abstract / text ()” representing the structure including the text T and the blank character. The character string length is longer than the shaping blank character “↓ t” (see FIG. 7C) associated with the third entry in the information storage unit 55. In this case, the second blank character determination unit 532 determines that the blank character string “↓ tt” is a blank character for formatting the text T (step S16). In step S16, the second blank character determination unit 532 associates “↓ tt” determined as the formatting blank character with the structure information “/ patent / abstract / text ()” representing the structure including the text T. Thus, it is stored in the blank character information storage unit 55. FIG. 7D shows the contents of the blank character information storage unit 55 at this time.

8-2) 意味のあるテキストＴ１
テキストＴの先頭の空白文字列「↓ｔｔ」に後続するＴ１は空白文字列ではない（ステップＳ１３）。この場合、第２の空白文字判定部５３２は、「Ｔ１を意味のあるテキストであると判定する（ステップＳ１４）。第２の空白文字判定部５３２は、ステップＳ１４で判定された意味のあるテキストＴ１を、例えば、前記テキスト格納部内の既に格納されているテキスト（「情報処理装置」）の後ろに格納する。 8-2) Meaningful text T1
T1 following the first blank character string “↓ tt” of the text T is not a blank character string (step S13). In this case, the second blank character determination unit 532 determines that “T1 is a meaningful text (step S14). The second blank character determination unit 532 determines the meaningful text determined in step S14. For example, T1 is stored behind the already stored text (“information processing apparatus”) in the text storage unit.

8-3) 空白文字列「↓ｔｔ」
意味のあるテキストＴ１に後続する空白文字列「↓ｔｔ」（つまりテキストＴ内の２番目の空白文字列）は、テキストＴの先頭及び末尾以外（つまりテキストＴの中間部）に現れる（ステップＳ１３，Ｓ１５，Ｓ１７）。このように、空白文字列「↓ｔｔ」がテキストＴの中間部に現れる場合、第２の空白文字判定部５３２は空白文字情報蓄積部５５を参照して、当該テキストＴを含む構造（該当構造）を表す構造情報「/特許/要旨/text()」に対応付けて整形用空白文字（整形用空白文字列）が蓄積されているか（つまり該当構造の整形用空白文字が蓄積されているか）を判定する（ステップＳ１９）。このとき空白文字情報蓄積部５５の４番目のエントリには、テキストＴの先頭の空白文字列「↓ｔｔ」が、図７（ｄ）に示されるように、確定された整形用空白文字列として構造情報「/特許/要旨/text()」に対応付けて蓄積されている。 8-3) Blank character string “↓ tt”
The blank character string “↓ tt” (that is, the second blank character string in the text T) following the meaningful text T1 appears at a position other than the beginning and end of the text T (that is, the middle part of the text T) (step S13). , S15, S17). As described above, when the blank character string “↓ tt” appears in the middle portion of the text T, the second blank character determination unit 532 refers to the blank character information storage unit 55 and includes the structure including the text T (corresponding structure). ) Whether or not formatting blank characters (formatting blank character strings) are stored in association with the structure information “/ patent / abstract / text ()” (that is, formatting blank characters of the corresponding structure are stored) Is determined (step S19). At this time, in the fourth entry of the blank character information storage unit 55, the blank character string “↓ tt” at the beginning of the text T is used as a fixed blank character string for formatting as shown in FIG. It is stored in association with the structure information “/ patent / abstract / text ()”.

第２の空白文字判定部５３２は、該当構造の整形用空白文字列が空白文字情報蓄積部５５に蓄積されている場合（ステップＳ１９）、現在処理対象となっている空白文字列「↓ｔｔ」が、当該整形用空白文字（ここでは、整形用空白文字列「↓ｔｔ」）に一致するかを判定する（ステップＳ２１）。 The second blank character determination unit 532 determines that the blank character string “↓ tt” currently being processed when the blank character string for formatting with the corresponding structure is accumulated in the blank character information accumulation unit 55 (step S19). Is matched with the formatting blank character (here, the formatting blank character string “↓ tt”) (step S21).

本実施形態のように、現在処理対象となっている空白文字列「↓ｔｔ」が空白文字情報蓄積部５５に蓄積されている該当構造の整形用空白文字列に一致する場合、第２の空白文字判定部５３２は、当該整形用空白文字列が候補であるか否か（確定済みであるか）を判定する（ステップＳ２３）。ここで、該当構造の整形用空白文字列「↓ｔｔ」、つまりテキストＴの先頭の空白文字列「↓ｔｔ」は、図７（ｄ）に示されるように、確定済みである。この場合、第２の空白文字判定部５３２は、現在処理対象となっている空白文字列「↓ｔｔ」（つまりテキストＴ内の２番目の空白文字列）が、空白文字情報蓄積部５５に蓄積されている、テキストＴ内の１番目（先頭）の空白文字列「↓ｔｔ」と同様に、構造情報「/特許/要旨/英文/text()」に関連する整形用空白文字であると判定する（ステップＳ２５）
8-4) 意味のあるテキスト
テキストＴ内の２番目の空白文字列「↓ｔｔ」に後続するＴ２は空白文字列ではない（ステップＳ１３）。この場合、Ｔ２は前記Ｔ１と同様に、意味のあるテキストであると判定される（ステップＳ１４）。第２の空白文字判定部５３２は、ステップＳ１４で判定された意味のあるテキストＴ２を、例えば、前記テキスト格納部に既に格納されているテキストＴ１に後続するように、当該テキスト格納部に格納する。 As in the present embodiment, when the blank character string “↓ tt” currently being processed matches the formatting blank character string of the corresponding structure stored in the blank character information storage unit 55, the second blank The character determination unit 532 determines whether or not the shaping blank character string is a candidate (has been confirmed) (step S23). Here, the formatting blank character string “↓ tt” of the corresponding structure, that is, the leading blank character string “↓ tt” of the text T has been confirmed as shown in FIG. In this case, the second blank character determination unit 532 accumulates the blank character string “↓ tt” (that is, the second blank character string in the text T) currently being processed in the blank character information accumulation unit 55. As with the first (first) blank character string “↓ tt” in the text T, it is determined to be a formatting blank character related to the structure information “/ patent / abstract / English / text ()”. (Step S25)
8-4) Meaningful text T2 following the second blank character string “↓ tt” in the text T is not a blank character string (step S13). In this case, T2 is determined to be a meaningful text in the same manner as T1 (step S14). The second blank character determination unit 532 stores the meaningful text T2 determined in step S14 in the text storage unit, for example, following the text T1 already stored in the text storage unit. .

8-5) 空白文字列「↓ｔｔ」
意味のあるテキストＴ２に後続する空白文字列「↓ｔｔ」、つまりテキストＴ内の３番目の空白文字列「↓ｔｔ」は、当該テキストＴの中間部に現れる（ステップＳ１３，Ｓ１５，Ｓ１７）。この場合、３番目の空白文字列「↓ｔｔ」は、前記した２番目の空白文字列「↓ｔｔ」の場合と同様に処理される。 8-5) Blank character string “↓ tt”
The blank character string “↓ tt” following the meaningful text T2, that is, the third blank character string “↓ tt” in the text T appears in the middle part of the text T (steps S13, S15, S17). In this case, the third blank character string “↓ tt” is processed in the same manner as the second blank character string “↓ tt”.

8-6) 意味のあるテキストＴ３
テキストＴ内の３番目の空白文字列「↓ｔｔ」に後続するＴ３は空白文字列ではない（ステップＳ１３）。この場合、Ｔ３は前記Ｔ１と同様に、意味のあるテキストであると判定される（ステップＳ１４）。テキストＴ３は、前記テキスト格納部に既に格納されているテキストＴ２に後続するように、当該テキスト格納部に格納される。 8-6) Meaningful text T3
T3 subsequent to the third blank character string “↓ tt” in the text T is not a blank character string (step S13). In this case, T3 is determined to be a meaningful text in the same manner as T1 (step S14). The text T3 is stored in the text storage unit so as to follow the text T2 already stored in the text storage unit.

8-7) 空白文字列「↓ｔ」
意味のあるテキストＴ３に後続する空白文字列「↓ｔ」、つまりテキストＴ内の４番目の空白文字列「↓ｔ」は、当該テキストＴの末尾に現れる（ステップＳ１３，Ｓ１５，Ｓ１７）。この場合、第２の空白文字判定部５３２は、テキストＴの末尾（４番目）の空白文字列「↓ｔ」を、次の要素に関連する整形用空白文字列であると判定して、当該文字列「↓ｔ」を該当構造の構造情報「/特許/要旨/text()」と対応付けて空白文字一時格納部５４に格納する（ステップＳ１８）。空白文字一時格納部５４に格納された空白文字列「↓ｔ」（テキストＴの末尾の空白文字列）の次の文書データは<要旨>要素の終了タグであり、次の要素は存在しないため、当該空白文字列「↓ｔ」は整形用空白文字列として使用されない。
以上により、テキストＴについての処理が終了する。 8-7) Blank character string “↓ t”
The blank character string “↓ t” following the meaningful text T3, that is, the fourth blank character string “↓ t” in the text T appears at the end of the text T (steps S13, S15, and S17). In this case, the second blank character determination unit 532 determines that the last (fourth) blank character string “↓ t” of the text T is a formatting blank character string related to the next element, and The character string “↓ t” is stored in the blank character temporary storage unit 54 in association with the structure information “/ patent / abstract / text ()” of the corresponding structure (step S18). The document data next to the blank character string “↓ t” (blank character string at the end of the text T) stored in the blank character temporary storage unit 54 is an end tag of the <summary> element, and the next element does not exist. The blank character string “↓ t” is not used as a formatting blank character string.
Thus, the process for the text T is completed.

(9) </要旨>
構文解析部５２は、テキストＴについての処理が終了すると、当該テキストＴに後続する文書データの構造解析を行う（ステップＳ２，Ｓ３）。テキストＴに後続する文書データは</要旨>タグである。 (9) </ Summary>
When the process for the text T is completed, the syntax analysis unit 52 performs a structure analysis of document data subsequent to the text T (steps S2 and S3). The document data following the text T is a </ summary> tag.

</要旨>タグは、構造情報「/特許/要旨」で示される<要旨>要素の終了を示す終了タグである（ステップＳ４，Ｓ５）。この場合、構文解析部５２は、対象階層を１レベル上げる（ステップＳ７）。これにより処理対象となる階層の構造情報は「/特許」となる。そして構文解析部５２は、</要旨>タグに後続する文書データの構造解析を行う（ステップＳ２，Ｓ３）。 The </ summary> tag is an end tag indicating the end of the <summary> element indicated by the structure information "/ patent / summary" (steps S4 and S5). In this case, the syntax analysis unit 52 raises the target hierarchy by one level (step S7). As a result, the structure information of the hierarchy to be processed becomes “/ patent”. Then, the syntax analysis unit 52 analyzes the structure of the document data following the </ summary> tag (steps S2 and S3).

(10) ↓
</要旨>タグに後続する文書データは「↓」である。この「↓」は、現在の処理対象階層をなす<特許>要素の内容（の一部）であり、「ignorable white space」である（ステップＳ４，Ｓ８）。そこで第１の空白文字判定部５３１は、この「ignorable white space」つまり「↓」を次に出現する要素の整形用空白文字であるとして、当該「↓」を現在の処理対象階層の構造情報「/特許」に対応付けて空白文字一時格納部５４に格納する（ステップＳ９）。図８（ｂ）は、このときの空白文字一時格納部５４の内容を示す。空白文字一時格納部５４に格納された「↓」（ignorable white space）の次の文書データは<特許>要素の終了タグであり、次の要素は存在しないため、当該「↓」は整形用空白文字列として使用されない。 (10) ↓
</ Summary> The document data following the tag is “↓”. This “↓” is the content of (part of) the <patent> element constituting the current processing target hierarchy, and is “ignorable white space” (steps S4 and S8). Therefore, the first blank character determination unit 531 assumes that “ignorable white space”, that is, “↓” is the white space for formatting the element that appears next, and regards the “↓” as the structure information “ / Patent ”is stored in the blank character temporary storage unit 54 (step S9). FIG. 8B shows the contents of the blank character temporary storage unit 54 at this time. Since the document data next to “↓” (ignorable white space) stored in the temporary space character storage unit 54 is an end tag of the <patent> element, and the next element does not exist, the “↓” is a formatting blank. Not used as a string.

(11) </特許>
</要素>タグに後続する「ignorable white space」である「↓」が処理されると、構文解析部５２は当該「↓」に後続する文書データの構造解析を行う（ステップＳ２，Ｓ３）。この「↓」（ignorable white space）に後続する文書データは</特許>タグである。</特許>タグは、構造情報「/特許」で示される最上位の階層の<特許>要素の終了を示す終了タグである（ステップＳ４，Ｓ５）。この場合、構文解析部５２は構造情報「/特許」の終了を判定し（ステップＳ２）、ＸＭＬ文書の解析を終了する。 (11) </ patent>
When “↓” that is “ignorable white space” following the </ element> tag is processed, the syntax analysis unit 52 analyzes the structure of the document data following the “↓” (steps S2 and S3). Document data following this “↓” (ignorable white space) is a </ patent> tag. The </ patent> tag is an end tag indicating the end of the <patent> element in the highest hierarchy indicated by the structure information “/ patent” (steps S4 and S5). In this case, the syntax analysis unit 52 determines the end of the structure information “/ patent” (step S2), and ends the analysis of the XML document.

この時点においてテキスト格納部には、テキスト「情報処理装置」が格納されている。テキスト格納部には更に、テキストＴに関して、Ｔ１，Ｔ２及びＴ３が格納されている。即ちテキスト格納部には、テキスト「情報処理装置」に加えて、テキストＴから整形用空白文字列が取り除かれた部分が格納されている。 At this time, the text “information processing apparatus” is stored in the text storage unit. The text storage unit further stores T1, T2, and T3 for the text T. That is, in the text storage unit, in addition to the text “information processing apparatus”, a part obtained by removing the formatting blank character string from the text T is stored.

索引作成部５６は、テキスト格納部に格納されている、テキスト「情報処理装置」と、テキストＴから整形用空白文字列が取り除かれた部分（Ｔ１，Ｔ２及びＴ３）とに基づき、つまり意味のあるテキストに基づき、当該意味のあるテキストを含むＸＭＬ文書の検索に用いられる索引（索引レコード）を作成する。 The index creation unit 56 is based on the text “information processing apparatus” stored in the text storage unit and the parts (T1, T2, and T3) from which the formatting blank character string is removed from the text T, that is, Based on a certain text, an index (index record) used to search for an XML document including the meaningful text is created.

索引作成部５６は、作成された索引レコードを索引４２２の一部として、ＸＭＬ文書原文に対応付けてＸＭＬデータベース４２に格納する。この場合、テキストＴに含まれている「構造情↓ｔｔ報」は「構造情報」という単語として認識されるため、「構造情報」をキーワードとする索引レコードを作成することができる。 The index creation unit 56 stores the created index record in the XML database 42 as a part of the index 422 in association with the XML document original. In this case, since “structure information ↓ tt information” included in the text T is recognized as the word “structure information”, an index record having “structure information” as a keyword can be created.

次に、図５に示すＸＭＬ文書（第１のＸＭＬ文書）とは異なるＸＭＬ文書（第２のＸＭＬ文書）が文書読込部５１によって読み込まれたものとする。図９は、この第２のＸＭＬ文書の一例を示す。図９に示す第２のＸＭＬ文書が第５に示す第１のＸＭＬ文書と相違する点は、テキストに空白文字が含まれているものの、当該テキストの先頭には空白文字が含まれておらず、且つ当該テキスト内に要素が現れることである。 Next, it is assumed that an XML document (second XML document) different from the XML document (first XML document) shown in FIG. 5 is read by the document reading unit 51. FIG. 9 shows an example of the second XML document. The difference between the second XML document shown in FIG. 9 and the first XML document shown in FIG. 9 is that the text contains a blank character, but the text does not contain a blank character at the beginning. And an element appears in the text.

図９に示す第２のＸＭＬ文書は、「↓」で改行文字、「ｔ」で１文字のタブ文字を、「ｓ」で空白文字を示すならば、ＸＭＬデータベース管理システム５０によってこの文書を処理する際に以下のような文書
「<特許>↓ｔ<名称>情報処理装置</名称>↓ｔ<要旨>この装置は、文書を読み込んで、その構造を解析し、取り出した構造情↓ｔｓｓｓｓｓｓ報と原文をともに外部記憶装置に蓄積する。ユーザはキーワードにより↓ｔｓｓｓｓｓｓ必要な情報を取り出すことができる。↓ｔｔ<英文>↓ｔｔｔThe device reads a document and stores both the structure information↓ｔｔｔand texts in an external storage. ↓ｔｔ</英文>↓ｔ</要旨>↓</特許>」
として扱われる。
図１０は、図９に示すＸＭＬ文書を、当該ＸＭＬ文書に含まれる改行文字及びタブ文字が記号「↓」及び「ｔ」で置き換えられた形式で示す。 If the second XML document shown in FIG. 9 indicates a line feed character with “↓”, a tab character with “t”, and a space character with “s”, this document is processed by the XML database management system 50. The following document “<patent> ↓ t <name> information processing device </ name> ↓ t <summary> This device reads a document, analyzes its structure, extracts the structure information ↓ tsssss Both the information and the original text are stored in the external storage device.The user can retrieve the necessary information ↓ tsssss by keyword ↓ tt <English> ↓ tt The device reads a document and stores both the structure information ↓ ttand texts in an external storage. ↓ tt </ English> ↓ t </ abstract> ↓ </ patent>"
Are treated as
FIG. 10 shows the XML document shown in FIG. 9 in a format in which the line feed character and the tab character included in the XML document are replaced with symbols “↓” and “t”.

図１０のＸＭＬ文書（第２のＸＭＬ文書）の例では、当該ＸＭＬ文書は、以下に整理して示す文書データ
(1) <特許> ：最上位の構造の要素を示す開始タグ
(2) ↓ｔ：<特許>要素の内容であるが、「ignorable white space」である
(3) <名称> ：２段目の構造の要素を示す開始タグ
(4) 情報処理装置：<名称>要素の内容であるテキスト
(5) </名称> ：２段目の構造の要素の終わり示す終了タグ
(6) ↓ｔ：<特許>要素の内容であるが、「ignorable white space」である
(7) <要旨> ：２段目の構造の要素を示す開始タグ
(8) この装置は、文書を読み込んで、その構造を解析し、取り出した構造情↓ｔｓｓｓｓｓｓ報と原文をともに外部記憶装置に蓄積する。ユーザはキーワードにより↓ｔｓｓｓｓｓｓ必要な情報を取り出すことができる。↓ｔｔ
：<要旨>要素の内容であるテキスト（以下、テキストＴａと表現）
(9) <英文> ：３段目の構造の要素を示す開始タグ
(10) ↓ｔｔｔThe device reads a document and stores both the structure information↓ｔｔｔand texts in an external storage. ↓ｔｔ
：<英文>要素の内容であるテキスト（以下、テキストＴｂと表現）
(11) </英文> ：３段目の構造の要素の終わり示す終了タグ
(12) ↓ｔ：<要素>要素の内容であるが、「ignorable white space」である
(13) </要旨> ：２段目の構造の要素の終わり示す終了タグ
(14) ↓ ：<特許>要素の内容であるが、「ignorable white space」である
(15) </特許> ：最上位の階層の要素の終わりを示す終了タグ
に分解される。 In the example of the XML document (second XML document) in FIG. 10, the XML document is document data organized and shown below.
(1) <Patent>: Start tag indicating the element of the highest structure
(2) ↓ t: The content of the <patent> element, but “ignorable white space”
(3) <Name>: Start tag indicating the element of the second level structure
(4) Information processing device: Text that is the contents of the <name> element
(5) </ Name>: End tag that indicates the end of the element of the second level structure
(6) ↓ t: The content of the <patent> element, but “ignorable white space”
(7) <Summary>: Start tag indicating the element of the second level structure
(8) This apparatus reads a document, analyzes its structure, and stores the extracted structure information ↓ tssssss and original text in an external storage device. The user can take out necessary information by using a keyword. ↓ tt
: <Summary> Text that is the content of the element (hereinafter referred to as text Ta)
(9) <English>: Start tag indicating the element of the third level structure
(10) ↓ ttThe device reads a document and stores both the structure information ↓ ttand texts in an external storage.
: Text that is the content of the <English> element (hereinafter referred to as text Tb)
(11) </ English>: End tag that indicates the end of the element of the third level structure
(12) ↓ t: <element> The contents of the element, but “ignorable white space”
(13) </ Summary>: End tag that indicates the end of the element in the second stage structure
(14) ↓: <patent> element content, but "ignorable white space"
(15) </ patent>: Decomposed into an end tag indicating the end of the element of the highest hierarchy.

以下、図１０のＸＭＬ文書（第２のＸＭＬ文書）が上述の文書データ(1)〜(15)に分解して解析される場合について、図４Ａ及び図４Ｂのフローチャートに従い、図１１及び図１２を参照して順次説明する。図１１は空白文字情報蓄積部５５の内容の変化を示す図、図１２は空白文字一時格納部５４の内容の変化を示す図である。 Hereinafter, in the case where the XML document (second XML document) in FIG. 10 is analyzed after being decomposed into the above document data (1) to (15), according to the flowcharts in FIGS. 4A and 4B, FIGS. Will be described sequentially. FIG. 11 is a diagram showing changes in the contents of the blank character information storage unit 55, and FIG. 12 is a diagram showing changes in the contents of the blank character temporary storage unit 54.

図１０のＸＭＬ文書（第２のＸＭＬ文書）における文書データ(1)〜(15)のうち、文書データ(1)〜(7)に対する処理の内容は、図６のＸＭＬ文書（第１のＸＭＬ文書）における文書データ(1)〜(7)に対する処理の内容と同一である。したがって、ここでは図１０のＸＭＬ文書（第２のＸＭＬ文書）における文書データ(8)〜(15)に対する処理について説明する。図１０のＸＭＬ文書（第２のＸＭＬ文書）における文書データ(1)〜(7)に対する処理については、図６のＸＭＬ文書（第１のＸＭＬ文書）における文書データ(1)〜(7)に対する処理に関する説明を参照されたい。 Of the document data (1) to (15) in the XML document (second XML document) of FIG. 10, the content of the processing for the document data (1) to (7) is the XML document (first XML document of FIG. 6). This is the same as the content of the processing for the document data (1) to (7) in (Document). Therefore, here, processing for document data (8) to (15) in the XML document (second XML document) of FIG. 10 will be described. The processing for the document data (1) to (7) in the XML document (second XML document) in FIG. 10 is performed on the document data (1) to (7) in the XML document (first XML document) in FIG. Refer to the explanation on processing.

(8) Ｔａ（テキスト）
<要旨>タグに後続する文書データＴａは、構造情報「/特許/要旨」のテキスト、つまり
「この装置は、文書を読み込んで、その構造を解析し、取り出した構造情↓ｔｓｓｓｓｓｓ報と原文をともに外部記憶装置に蓄積する。ユーザはキーワードにより↓ｔｓｓｓｓｓｓ必要な情報を取り出すことができる。↓ｔｔ」
である（ステップＳ４，Ｓ８）。この場合、分解部５３３は、テキストＴａを先頭から順に、空白文字列と、空白文字列を含まないテキスト部分（意味のあるテキスト）とに分解する（ステップＳ１０，Ｓ１１）。 (8) Ta (text)
<Summary> The document data Ta following the tag is the text of the structure information “/ patent / abstract”, that is, “This device reads the document, analyzes the structure, extracts the structure information ↓ tssssss and the original text. Both are stored in the external storage device.The user can retrieve necessary information by using the keyword ↓ tsssss. ↓ tt ”
(Steps S4 and S8). In this case, the decomposition unit 533 decomposes the text Ta into a blank character string and a text part (meaningful text) that does not include the blank character string in order from the top (steps S10 and S11).

ここでは、テキストＴａは、
8-1) 意味のあるテキスト「この装置は、文書を読み込んで、その構造を解析し、取り出した構造情」（以下、意味のあるテキストＴａ１と称する）
8-2) 空白文字列「↓ｔｓｓｓｓｓｓ」
8-3) 意味のあるテキスト「報と原文をともに外部記憶装置に蓄積する。ユーザはキーワードにより」（以下、意味のあるテキストＴａ２と称する）
8-4) 空白文字列「↓ｔｓｓｓｓｓｓ」
8-5) 意味のあるテキスト「必要な情報を取り出すことができる。」（以下、意味のあるテキストＴａ３と称する）
8-6) 空白文字列「↓ｔｔ」
に分解される。 Here, the text Ta is
8-1) Meaningful text “This device reads the document, analyzes its structure, and extracts it” (hereinafter referred to as meaningful text Ta1)
8-2) Blank character string “↓ tsssss”
8-3) Meaningful text “Both information and original text are stored in an external storage device. The user uses a keyword” (hereinafter referred to as meaningful text Ta2).
8-4) Blank character string “↓ tsssss”
8-5) Meaningful text “Necessary information can be retrieved” (hereinafter referred to as meaningful text Ta3)
8-6) Blank string “↓ tt”
Is broken down into

以下、テキストＴａ（つまり構造情報「/特許/要旨」のテキスト）が分解された上述の8-1)乃至8-6)の各データに対する処理について、順次説明する。 Hereinafter, processing for each data of the above-described 8-1) to 8-6) in which the text Ta (that is, the text of the structure information “/ patent / abstract”) is decomposed will be sequentially described.

8-1) 意味のあるテキストＴａ１
テキストＴの先頭は意味のあるテキストＴａ１であり、空白文字列ではないため（ステップＳ１３，Ｓ１４）、テキストＴａ１が解析された時点では、次の要素の構造「/特許/要旨/text()」に関連する整形用空白文字は未だ確定されない。テキストＴａ１は、テキスト格納部に格納される。 8-1) Meaningful text Ta1
Since the beginning of the text T is a meaningful text Ta1 and not a blank character string (steps S13 and S14), the structure of the next element “/ patent / abstract / text ()” is obtained when the text Ta1 is analyzed. The formatting whitespace associated with is not yet determined. The text Ta1 is stored in the text storage unit.

8-2) 空白文字列「↓ｔｓｓｓｓｓｓ」
意味のあるテキストＴａ１に後続する空白文字列「↓ｔｓｓｓｓｓｓ」は、テキストＴａ１の中間部に現れる（ステップＳ１３，Ｓ１５，Ｓ１７）。この空白文字列「↓ｔｓｓｓｓｓｓ」は、テキストＴａ１において最初（１番目）に現れる空白文字列であり、当該テキストＴａ１に対応する構造情報「/特許/要旨/text()」に関連する整形用空白文字は空白文字情報蓄積部５５に存在しない（ステップＳ１９）。また空白文字列「↓ｔｓｓｓｓｓｓ」は、「/特許/要旨/text()」の１レベル上位の構造を表す構造情報「/特許/要旨」と空白文字情報蓄積部５５内の３番目のエントリで対応付けられている整形用空白文字「↓ｔ」（図７（ｃ）参照）より文字列長が長い。 8-2) Blank character string “↓ tsssss”
The blank character string “↓ tsssss” following the meaningful text Ta1 appears in the middle of the text Ta1 (steps S13, S15, S17). This blank character string “↓ tsssss” is a blank character string that appears first (first) in the text Ta1, and is a formatting blank related to the structure information “/ patent / abstract / text ()” corresponding to the text Ta1. The character does not exist in the blank character information storage unit 55 (step S19). The blank character string “↓ tssssss” is structural information “/ patent / abstract” representing the structure one level higher than “/ patent / abstract / text ()” and the third entry in the blank character information storage unit 55. The character string length is longer than the corresponding shaping blank character “↓ t” (see FIG. 7C).

この場合、第２の空白文字判定部５３２は、空白文字列「↓ｔｓｓｓｓｓｓ」を、構造情報「/特許/要旨/text()」に関連する整形用空白文字の候補として、当該構造情報「/特許/要旨/text()」に対応付けて、図７（ｃ）の状態にある空白文字情報蓄積部５５の４番目のエントリに蓄積する（ステップＳ２０）。 In this case, the second blank character determination unit 532 uses the blank character string “↓ tssssss” as a candidate for blank characters for formatting related to the structure information “/ patent / abstract / text ()”. In association with “patent / abstract / text ()”, it is stored in the fourth entry of the blank character information storage unit 55 in the state of FIG. 7C (step S 20).

図１１（ａ）は、このときの空白文字情報蓄積部５５の内容を示す。図１１（ａ）では、空白文字情報蓄積部５５の４番目のエントリに含まれるフラグが、「候補（整形用空白文字の候補）」を示す状態に設定されている。 FIG. 11A shows the contents of the blank character information storage unit 55 at this time. In FIG. 11A, the flag included in the fourth entry of the blank character information storage unit 55 is set to a state indicating “candidate (candidate blank character for shaping)”.

8-3) 意味のあるテキストＴａ２
テキストＴａ内の１番目の空白文字列「↓ｔｓｓｓｓｓｓ」に後続するＴａ２は空白文字列ではない（ステップＳ１３）。この場合、Ｔａ２は意味のあるテキストであると判定される（ステップＳ１４）。テキストＴａ２は、テキスト格納部に既に格納されているテキストＴａ１に後続するように、当該テキスト格納部に格納される。 8-3) Meaningful text Ta2
Ta2 following the first blank character string “↓ tsssss” in the text Ta is not a blank character string (step S13). In this case, Ta2 is determined to be a meaningful text (step S14). The text Ta2 is stored in the text storage unit so as to follow the text Ta1 already stored in the text storage unit.

8-4) 空白文字列「↓ｔｓｓｓｓｓｓ」
意味のあるテキストＴａ１に後続する空白文字列「↓ｔｓｓｓｓｓｓ」、つまりテキストＴａ内の２番目の空白文字列「↓ｔｓｓｓｓｓｓ」は、当該テキストＴａの中間部に現れる（ステップＳ１３，Ｓ１５，Ｓ１７）。この２番目の空白文字列「↓ｔｓｓｓｓｓｓ」は、空白文字情報蓄積部５５の４番目のエントリに整形用空白文字の候補として蓄積されている１番目の空白文字列「↓ｔｓｓｓｓｓｓ」に一致する（ステップＳ１９、Ｓ２１，Ｓ２３）。この場合、第２の空白文字判定部５３２は、１番目の空白文字列及び２番目の空白文字列（つまり現在処理対象となっている空白文字列）は共に整形用空白文字であるとして、空白文字情報蓄積部５５の４番目のエントリに蓄積されている１番目の空白文字列「↓ｔｓｓｓｓｓｓ」を整形用空白文字列と確定する（ステップＳ２４）。つまり第２の空白文字判定部５３２は、空白文字列「↓ｔｓｓｓｓｓｓ」が蓄積されている空白文字情報蓄積部５５の４番目のエントリに含まれるフラグを、図１１（ａ）における「候補」を示す状態から「確定」を示す状態に変更する。図１１（ｂ）は、このときの空白文字情報蓄積部５５の内容を示す。 8-4) Blank character string “↓ tsssss”
The blank character string “↓ tsssss” following the meaningful text Ta1, that is, the second blank character string “↓ tsssss” in the text Ta appears in the middle part of the text Ta (steps S13, S15, S17). This second blank character string “↓ tssssss” matches the first blank character string “↓ tssssss” stored as a candidate for the formatting blank character in the fourth entry of the blank character information storage unit 55 ( Steps S19, S21, S23). In this case, the second blank character determination unit 532 determines that both the first blank character string and the second blank character string (that is, the blank character string currently being processed) are formatting blank characters. The first blank character string “↓ tssssss” stored in the fourth entry of the character information storage unit 55 is determined as a formatting blank character string (step S24). That is, the second blank character determination unit 532 sets the flag included in the fourth entry of the blank character information storage unit 55 in which the blank character string “↓ tsssss” is stored as “candidate” in FIG. The state shown is changed to the state showing “confirmed”. FIG. 11B shows the contents of the blank character information storage unit 55 at this time.

8-5) 意味のあるテキストＴａ３
テキストＴａ内の２番目の空白文字列「↓ｔｓｓｓｓｓｓ」に後続するＴａ３は空白文字列ではない（ステップＳ１３）。この場合、Ｔａ３は意味のあるテキストであると判定される（ステップＳ１４）。テキストＴａ３は、テキスト格納部に既に格納されているテキストＴａ２に後続するように、当該テキスト格納部に格納される。 8-5) Meaningful text Ta3
Ta3 following the second blank character string “↓ tsssss” in the text Ta is not a blank character string (step S13). In this case, Ta3 is determined to be a meaningful text (step S14). The text Ta3 is stored in the text storage unit so as to follow the text Ta2 already stored in the text storage unit.

8-6) 空白文字列「↓ｔｔ」
意味のあるテキストＴａ３に後続する空白文字列「↓ｔｔ」、つまりテキストＴａ内の３番目の空白文字列「↓ｔｔ」は、当該テキストＴａの末尾に現れる（ステップＳ１３，Ｓ１５，Ｓ１７）。この場合、第２の空白文字判定部５３２は、テキストＴａの末尾（４番目）の空白文字列「↓ｔｔ」を、次の要素に関連する整形用空白文字列であると判定して、当該文字列「↓ｔｔ」を該当構造の構造情報「/特許/要旨/」と対応付けて空白文字一時格納部５４に格納する（ステップＳ１８）。図１２（ａ）は、このときの空白文字一時格納部５４の内容を示す。 8-6) Blank string “↓ tt”
The blank character string “↓ tt” following the meaningful text Ta3, that is, the third blank character string “↓ tt” in the text Ta appears at the end of the text Ta (steps S13, S15, S17). In this case, the second blank character determination unit 532 determines that the last (fourth) blank character string “↓ tt” of the text Ta is a formatting blank character string related to the next element, and The character string “↓ tt” is stored in the blank character temporary storage unit 54 in association with the structure information “/ patent / abstract /” of the corresponding structure (step S18). FIG. 12A shows the contents of the blank character temporary storage unit 54 at this time.

以上により、テキストＴａについての処理が終了する。 Thus, the process for the text Ta is completed.

(9) <英文>（/特許/要旨/英文）
テキストＴａの末尾の（整形用空白文字列と判定された）空白文字列「↓ｔｔ」に後続する文書データは<英文>タグである。<英文>タグは、構造情報「/特許/要旨/英文」で示される<英文>要素の開始を示す開始タグである（ステップＳ４，Ｓ５）。この場合、第１の空白文字判定部５３１は、空白文字一時格納部５４を参照することにより、構造情報「/特許/要旨/英文」が前回取得された文書データである整形用空白文字列（ここでは「↓ｔｔ」）に後続して記述されていることを認識する。そこで第１の空白文字判定部５３１は、構造情報「/特許/要旨/英文」が、前回取得された整形用空白文字列「↓ｔｔ」に後続して記述されていることを示すために、構造情報「/特許/名称/英文」を先行する整形用空白文字列「↓ｔｔ」と対応付けて空白文字情報蓄積部５５の５番目のエントリに蓄積する（ステップＳ６）。このとき第１の空白文字判定部５３１は、整形用空白文字「↓ｔｔ」が蓄積された空白文字情報蓄積部５５の５番目のエントリのフラグを、「確定」を示す状態に設定する。図１１（ｃ）は、このときの空白文字情報蓄積部５５の内容を示す。 (9) <English> (/ Patent / Abstract / English)
The document data following the blank character string “↓ tt” (determined as a blank character string for formatting) at the end of the text Ta is an <English sentence> tag. The <English> tag is a start tag indicating the start of the <English> element indicated by the structure information “/ patent / abstract / English” (steps S4 and S5). In this case, the first blank character determination unit 531 refers to the blank character temporary storage unit 54, so that the formatting blank character string (structural information “/ patent / abstract / English”) is the previously acquired document data ( Here, it is recognized that it is described subsequent to “↓ tt”). Therefore, the first blank character determination unit 531 indicates that the structure information “/ patent / abstract / English” is described subsequent to the previously obtained blank character string “↓ tt” for formatting. The structure information “/ patent / name / English” is stored in the fifth entry of the blank character information storage unit 55 in association with the preceding blank character string “↓ tt” (step S6). At this time, the first blank character determination unit 531 sets the flag of the fifth entry of the blank character information storage unit 55 in which the formatting blank character “↓ tt” is stored to a state indicating “confirmed”. FIG. 11C shows the contents of the blank character information storage unit 55 at this time.

(10) Ｔｂ（テキスト）
<英文>タグに後続する文書データは、テキストＴｂ、つまり
「↓ｔｔｔThe device reads a document and stores both the structure information↓ｔｔｔand texts in an external storage. ↓ｔｔ」
である（ステップＳ４，Ｓ８）。この場合、分解部５３３は、テキストＴｂを先頭から順に、空白文字列と、空白文字列を含まないテキスト部分（意味のあるテキスト）とに分解する（ステップＳ１０，Ｓ１１）。 (10) Tb (text)
The document data following the <English> tag is the text Tb, that is, “↓ tt The device reads a document and stores both the structure information ↓ ttand texts in an external storage.
(Steps S4 and S8). In this case, the decomposition unit 533 decomposes the text Tb in order from the top into a blank character string and a text portion (a meaningful text) that does not include the blank character string (steps S10 and S11).

ここでは、テキストＴｂは、
10-1) 空白文字列「↓ｔｔｔ」
10-2) 意味のあるテキスト「The device reads a document and stores both the structure information」（以下、意味のあるテキストＴｂ１と称する）
10-3) 空白文字列「↓ｔｔｔ」
10-4) 意味のあるテキスト「and texts in an external storage.」（以下、意味のあるテキストＴｂ２と称する）
10-5) 空白文字列「↓ｔｔ」
に分解される。 Here, the text Tb is
10-1) Blank character string “↓ ttt”
10-2) Meaningful text “The device reads a document and stores both the structure information” (hereinafter referred to as meaningful text Tb1)
10-3) Blank character string “↓ tt”
10-4) Meaningful text “and texts in an external storage” (hereinafter referred to as meaningful text Tb2)
10-5) Blank character string “↓ tt”
Is broken down into

以下、テキストＴｂ（つまり構造情報「/特許/要旨/英文」のテキスト）が分解された上述の10-1)乃至10-5)の各データに対する処理について、順次説明する。 Hereinafter, processing for each of the above data 10-1) to 10-5) in which the text Tb (that is, the text of the structure information “/ patent / abstract / English”) is decomposed will be sequentially described.

10-1) 空白文字列「↓ｔｔｔ」
この空白文字列「↓ｔｔｔ」は、テキストＴｂの先頭に現れる（ステップＳ１５）。また、この空白文字列「↓ｔｔｔ」は、テキストＴｂを含む構造を表す構造情報「/特許/要旨/英文/text()」の１レベル上位の構造を表す構造情報「/特許/要旨/英文」と空白文字情報蓄積部５５内の５番目のエントリで対応付けられている整形用空白文字「↓ｔｔ」（図１１（ｃ）参照）よりも文字列長が長い。 10-1) Blank character string “↓ ttt”
This blank character string “↓ ttt” appears at the beginning of the text Tb (step S15). This blank character string “↓ tt” is the structure information “/ patent / abstract / English” representing the structure one level higher than the structure information “/ patent / abstract / English / text ()” representing the structure including the text Tb. "Is longer than the formatting blank character" ↓ tt "(see FIG. 11C) associated with the fifth entry in the blank character information storage unit 55.

この場合、第２の空白文字判定部５３２は、空白文字列「↓ｔｔｔ」をテキストＴｂの整形用空白文字であると確定する（ステップＳ１６）。このステップＳ１６において第２の空白文字判定部５３２は、整形用空白文字と確定された「↓ｔｔｔ」を、テキストＴｂを含む構造を表す構造情報「/特許/要旨/英文/text()」に対応付けて、空白文字情報蓄積部５５の６番目のエントリに蓄積する。図１１（ｄ）は、このときの空白文字情報蓄積部５５の内容を示す。 In this case, the second blank character determination unit 532 determines that the blank character string “↓ ttt” is a blank character for formatting the text Tb (step S16). In step S16, the second blank character determination unit 532 converts “↓ ttt” determined as the formatting blank character into the structure information “/ patent / abstract / English / text ()” representing the structure including the text Tb. Correspondingly, it is stored in the sixth entry of the blank character information storage unit 55. FIG. 11D shows the contents of the blank character information storage unit 55 at this time.

10-2) 意味のあるテキストＴｂ１
テキストＴｂの先頭の空白文字列「↓ｔｔｔ」に後続するＴｂ１は空白文字列ではない（ステップＳ１３）。この場合、第２の空白文字判定部５３２は、Ｔｂ１を意味のあるテキストであると判定する（ステップＳ１４）。テキストＴｂ１はテキスト格納部に格納される。 10-2) Meaningful text Tb1
Tb1 following the first blank character string “↓ ttt” of the text Tb is not a blank character string (step S13). In this case, the second blank character determination unit 532 determines that Tb1 is a meaningful text (step S14). The text Tb1 is stored in the text storage unit.

10-3) 空白文字列「↓ｔｔｔ」
意味のあるテキストＴｂ１に後続する空白文字列「↓ｔｔｔ」、つまりテキストＴｂ内の２番目の空白文字列「↓ｔｔｔ」は、当該テキストＴｂの中間部に現れる（ステップＳ１３，Ｓ１５，Ｓ１７）。この２番目の空白文字列「↓ｔｔｔ」は、空白文字情報蓄積部５５の６番目のエントリに整形用空白文字列として確定されて蓄積されている１番目の空白文字列「↓ｔｔｔ」（図１１（ｄ）参照）に一致する（ステップＳ１９，Ｓ２１，Ｓ２３）。この場合、第２の空白文字判定部５３２は、現在処理対象となっている２番目の空白文字列「↓ｔｔｔ」が、空白文字情報蓄積部５５に蓄積されている、テキストＴｂ内の１番目の空白文字列「↓ｔｔｔ」と同様に、構造情報「/特許/要旨/英文/text()」に関連する整形用空白文字列であると判定する（ステップＳ２５）。 10-3) Blank character string “↓ tt”
The blank character string “↓ tt” following the meaningful text Tb1, that is, the second blank character string “↓ tt” in the text Tb appears in the middle of the text Tb (steps S13, S15, S17). This second blank character string “↓ ttt” is determined as the blank character string for formatting in the sixth entry of the blank character information storage unit 55 and stored as the first blank character string “↓ ttt” (FIG. 11 (d)) (steps S19, S21, S23). In this case, the second blank character determination unit 532 uses the first blank character string “↓ ttt” currently processed in the first character in the text Tb stored in the blank character information storage unit 55. As in the case of the blank character string “↓ tt”, it is determined that the blank character string for formatting is related to the structure information “/ patent / abstract / English / text ()” (step S25).

10-4) 意味のあるテキストＴｂ２
テキストＴｂ内の２番目の空白文字列「↓ｔｔｔ」に後続するＴｂ２は空白文字列ではない（ステップＳ１３）。この場合、第２の空白文字判定部５３２は、Ｔｂ２を意味のあるテキストであると判定する（ステップＳ１４）。テキストＴｂ２は、テキスト格納部に既に格納されているテキストＴｂ１に後続するように、当該テキスト格納部に格納される。 10-4) Meaningful text Tb2
Tb2 following the second blank character string “↓ ttt” in the text Tb is not a blank character string (step S13). In this case, the second blank character determination unit 532 determines that Tb2 is a meaningful text (step S14). The text Tb2 is stored in the text storage unit so as to follow the text Tb1 already stored in the text storage unit.

10-5) 空白文字列「↓ｔｔ」
意味のあるテキストＴｂ２に後続する空白文字列「↓ｔｔ」、つまりテキストＴｂ内の３番目の空白文字列「↓ｔｔ」は、当該テキストＴｂの末尾に現れる（ステップＳ１３，Ｓ１５，Ｓ１７）。 10-5) Blank character string “↓ tt”
The blank character string “↓ tt” following the meaningful text Tb2, that is, the third blank character string “↓ tt” in the text Tb appears at the end of the text Tb (steps S13, S15, S17).

この場合、第２の空白文字判定部５３２は、テキストＴｂの末尾（３番目）の空白文字列「↓ｔｔ」を、次の要素に関連する整形用空白文字列であると判定して、当該文字列「↓ｔｔ」を該当構造の構造情報「/特許/要旨/英文」と対応付けて空白文字一時格納部５４に格納する（ステップＳ１８）。空白文字一時格納部５４に格納された空白文字列「↓ｔｔ」（テキストＴｂの末尾の空白文字列）の次の文書データは<英文>要素の終了タグであり、次の要素は存在しないため、当該空白文字列「↓ｔｔ」は整形用空白文字列として使用されない。
以上により、テキストＴｂについての処理が終了する。 In this case, the second blank character determination unit 532 determines that the last (third) blank character string “↓ tt” of the text Tb is a formatting blank character string related to the next element, and The character string “↓ tt” is stored in the blank character temporary storage unit 54 in association with the structure information “/ patent / abstract / English” of the corresponding structure (step S18). The document data next to the blank character string “↓ tt” (blank character string at the end of the text Tb) stored in the blank character temporary storage unit 54 is an end tag of the <English> element, and the next element does not exist. The blank character string “↓ tt” is not used as a formatting blank character string.
Thus, the process for the text Tb ends.

(11) </英文>
構文解析部５２は、テキストＴｂについての処理が終了すると、当該テキストＴｂに後続する文書データの構造解析を行う（ステップＳ２，Ｓ３）。テキストＴｂに後続する文書データは</英文>タグである。 (11) </ English>
When the process for the text Tb is completed, the syntax analysis unit 52 performs a structure analysis of the document data subsequent to the text Tb (steps S2 and S3). The document data following the text Tb is a </ English> tag.

</英文>タグは、構造情報「/特許/要旨/英文」で示される<英文>要素の終了を示す終了タグである（ステップＳ４，Ｓ５）。この場合、構文解析部５２は、対象階層を１レベル上げる（ステップＳ７）。これにより処理対象となる階層の構造情報は「/特許/要旨」となる。 The </ English> tag is an end tag indicating the end of the <English> element indicated by the structure information “/ patent / abstract / English” (steps S4 and S5). In this case, the syntax analysis unit 52 raises the target hierarchy by one level (step S7). As a result, the structure information of the hierarchy to be processed becomes “/ patent / abstract”.

(12) ↓ｔ
</英文>タグに後続する文書データは「↓ｔ」である。この「↓ｔ」は、現在の処理対象階層をなす<要旨>要素の内容（の一部）であり、「ignorable white space」である（ステップＳ４，Ｓ８）。 (12) ↓ t
The document data following the </ English sentence> tag is “↓ t”. This “↓ t” is the content (part of) of the <summary> element constituting the current processing target hierarchy, and is “ignorable white space” (steps S4 and S8).

そこで第１の空白文字判定部５３１は、この「ignorable white space」である「↓ｔ」を次に出現する要素の整形用空白文字であるとして、当該「↓ｔ」を現在の処理対象階層の構造情報「/特許/要旨」に対応付けて空白文字一時格納部５４に格納する（ステップＳ９）。図１２（ｂ）は、このときの空白文字一時格納部５４の内容を示す。空白文字一時格納部５４に格納された空白文字列「↓ｔ」の次の文書データは<要素>要素の終了タグであるため、当該空白文字列「↓ｔ」は整形用空白文字列として使用されない。 Therefore, the first blank character determination unit 531 assumes that “↓ t”, which is this “ignorable white space”, is the formatting blank character of the element that appears next, and uses the “↓ t” as the current processing target hierarchy. Corresponding to the structure information “/ patent / abstract” is stored in the blank character temporary storage unit 54 (step S9). FIG. 12B shows the contents of the blank character temporary storage unit 54 at this time. Since the document data next to the blank character string “↓ t” stored in the blank character temporary storage unit 54 is an end tag of the <element> element, the blank character string “↓ t” is used as a formatting blank character string. Not.

(13) </要旨>
</要旨>タグは、構造情報「/特許/要旨」で示される<要旨>要素の終了を示す終了タグである（ステップＳ４，Ｓ５）。この場合、構文解析部５２は、対象階層を１レベル上げる（ステップＳ７）。これにより処理対象となる階層の構造情報は「/特許」となる。 (13) </ Summary>
The </ summary> tag is an end tag indicating the end of the <summary> element indicated by the structure information "/ patent / summary" (steps S4 and S5). In this case, the syntax analysis unit 52 raises the target hierarchy by one level (step S7). As a result, the structure information of the hierarchy to be processed becomes “/ patent”.

(14) ↓
</要旨>タグに後続する文書データは「↓」である。この「↓」は、現在の処理対象階層をなす<特許>要素の内容（の一部）であり、「ignorable white space」である（ステップＳ４，Ｓ８）。この場合、「↓」は、現在の処理対象階層の構造情報「/特許」に対応付けて空白文字一時格納部５４に格納される（ステップＳ９）。図１２（ｃ）は、このときの空白文字一時格納部５４の内容を示す。空白文字一時格納部５４に格納された「↓」（ignorable white space）の次の文書データは<特許>要素の終了タグであり、次の要素は存在しないため、当該「↓」は整形用空白文字列として使用されない。 (14) ↓
</ Summary> The document data following the tag is “↓”. This “↓” is the content of (part of) the <patent> element constituting the current processing target hierarchy, and is “ignorable white space” (steps S4 and S8). In this case, “↓” is stored in the blank character temporary storage 54 in association with the structure information “/ patent” of the current processing target hierarchy (step S9). FIG. 12C shows the contents of the blank character temporary storage unit 54 at this time. Since the document data next to “↓” (ignorable white space) stored in the temporary space character storage unit 54 is an end tag of the <patent> element, and the next element does not exist, the “↓” is a formatting blank. Not used as a string.

(15) </特許>
「↓」（ignorable white space）に後続する文書データは</特許>タグである。</特許>タグは、構造情報「/特許」で示される最上位の階層の<特許>要素の終了を示す終了タグである（ステップＳ４，Ｓ５）。この場合、構文解析部５２は構造情報「/特許」の終了を判定し（ステップＳ２）、ＸＭＬ文書の解析を終了する。 (15) </ patent>
Document data following “↓” (ignorable white space) is a </ patent> tag. The </ patent> tag is an end tag indicating the end of the <patent> element in the highest hierarchy indicated by the structure information “/ patent” (steps S4 and S5). In this case, the syntax analysis unit 52 determines the end of the structure information “/ patent” (step S2), and ends the analysis of the XML document.

この時点においてテキスト格納部には、テキスト「情報処理装置」が格納されている。テキスト格納部には更に、テキストＴａに関しては、Ｔａ１，Ｔａ２及びＴａ３が格納され、テキストＴｂに関しては、Ｔｂ１及びＴｂ２が格納されている。即ちテキスト格納部には、テキスト「情報処理装置」に加えて、テキストＴａ及びＴｂから整形用空白文字列が取り除かれた部分が格納されている。 At this time, the text “information processing apparatus” is stored in the text storage unit. The text storage unit further stores Ta1, Ta2, and Ta3 for the text Ta, and Tb1 and Tb2 for the text Tb. That is, in the text storage unit, in addition to the text “information processing apparatus”, a part obtained by removing the formatting blank character string from the texts Ta and Tb is stored.

索引作成部５６は、テキスト格納部に格納されている、テキスト「情報処理装置」と、テキストＴａ及びＴｂから整形用空白文字列が取り除かれた部分とに基づき、つまり意味のあるテキストに基づき、当該意味のあるテキストを含むＸＭＬ文書の検索に用いられる索引（索引レコード）を作成する。そして索引作成部５６は、作成された索引レコードを索引４２２の一部として、ＸＭＬ文書原文に対応付けてＸＭＬデータベース４２に格納する。 The index creation unit 56 is based on the text “information processing device” stored in the text storage unit and the portion obtained by removing the formatting blank character string from the texts Ta and Tb, that is, based on the meaningful text. An index (index record) used to search an XML document including the meaningful text is created. Then, the index creating unit 56 stores the created index record in the XML database 42 as a part of the index 422 in association with the XML document original.

なお、上述のＸＭＬ文書の例では、テキストの中間部に含まれる空白文字列が全て整形用空白文字列であると判定されている。もし、テキストの中間部に含まれる空白文字列が、空白文字情報蓄積部５５に蓄積されている該当構造の整形用空白文字列と異なる場合には（ステップＳ１９，Ｓ２１）、上記中間部に含まれる空白文字列は整形用空白文字列ではなくてテキスト（有効なテキスト）であると判定される（ステップＳ２２）。この場合、判定された空白文字列は、索引作成部５６による索引作成の対象となる。 In the example of the XML document described above, it is determined that all blank character strings included in the middle part of the text are formatting blank character strings. If the blank character string included in the intermediate portion of the text is different from the blank character string for formatting of the corresponding structure stored in the blank character information storage portion 55 (steps S19 and S21), it is included in the intermediate portion. It is determined that the blank character string to be displayed is not a formatting blank character string but text (valid text) (step S22). In this case, the determined blank character string is an index creation target by the index creation unit 56.

図１０（図９）に示すＸＭＬ文書（第２のＸＭＬ文書）の例では、説明の簡略化のために、英文のテキストＴａ１及びＴａ２に含まれる空白文字（半角スペース）は有効なテキストの一部として処理されている。しかし、テキストＴａ１及びＴａ２に含まれる半角スペースを空白文字として処理しても構わない。このようにしても、テキストＴａ１及びＴａ２に含まれる半角スペース（空白文字）はテキスト（有効なテキスト）として判定されるため（ステップＳ１９，Ｓ２１，Ｓ２２）、上記の例と同一となる。 In the example of the XML document (second XML document) shown in FIG. 10 (FIG. 9), for simplification of explanation, blank characters (single-byte spaces) included in the English texts Ta1 and Ta2 are valid texts. It is processed as a part. However, half-width spaces included in the texts Ta1 and Ta2 may be processed as blank characters. Even in this case, half-width spaces (blank characters) included in the texts Ta1 and Ta2 are determined as text (valid text) (steps S19, S21, and S22), and thus are the same as the above example.

上記実施形態では、構文解析部５２による文書解析処理後に、索引作成部５６による索引作成処理が行われる。しかし、構文解析部５２による文書解析処理と並行して索引作成部５６による索引作成処理が行われても良い。 In the above embodiment, after the document analysis processing by the syntax analysis unit 52, the index creation processing by the index creation unit 56 is performed. However, the index creation processing by the index creation unit 56 may be performed in parallel with the document analysis processing by the syntax analysis unit 52.

次に、上述のようにして作成された索引レコードを含む索引４２２を利用したＸＭＬ文書検索について、図１３のフローチャートを参照して簡単に説明する。
今、ユーザがクライアント端末２０を操作することにより、当該クライアント端末２０からＸＭＬデータベース管理システム５０（データベースサーバ１０）に対して、ネットワーク３０を介して検索要求が与えられたものとする。ここでは、単語「構造情報」をキーワードとするＸＭＬ文書検索が要求されたものとする。 Next, XML document search using the index 422 including the index record created as described above will be briefly described with reference to the flowchart of FIG.
Now, assume that a search request is given from the client terminal 20 to the XML database management system 50 (database server 10) via the network 30 when the user operates the client terminal 20. Here, it is assumed that an XML document search using the word “structure information” as a keyword is requested.

ＸＭＬデータベース管理システム５０内の検索部（図示せず）は、クライアント端末２０からの検索要求で指定されたキーワード（検索キーワード）「構造情報」に基づき、索引４２２（に含まれている索引ページ）を検索する（ステップＳ３１）。即ち検索部は、「構造情報」をキーワードとして含む全ての索引レコードを検索する。検索部は、検索された索引レコードにそれぞれ含まれている出現位置の情報に基づき、「構造情報」を含むＸＭＬ文書（の原文）を全て取り出す（ステップＳ３２）。これにより、例えば図６に示されるＸＭＬ文書も、文字列「構造情」と文字「報」との間に改行文字「↓」及び２つのタブ文字「ｔ」）が挿入されているにも拘わらずに、「構造情報」を含むＸＭＬ文書として取り出される。 A search unit (not shown) in the XML database management system 50 is based on a keyword (search keyword) “structure information” designated by a search request from the client terminal 20 and an index 422 (index page included in the index 422). Is searched (step S31). That is, the retrieval unit retrieves all index records including “structure information” as a keyword. The search unit extracts all the XML documents (original text) including “structure information” based on the appearance position information included in the searched index records (step S32). Thereby, for example, the XML document shown in FIG. 6 is inserted even though the line feed character “↓” and the two tab characters “t”) are inserted between the character string “structure information” and the character “information”. Instead, it is extracted as an XML document including “structure information”.

［変形例］
次に上記実施形態の変形例について説明する。この変形例の特徴は、ＸＭＬ文書（の実体）に当該ＸＭＬ文書の構造を定義（記述）した文書型定義（Document Type Definition: DTD）と呼ばれるスキーマ情報（文書型定義情報）が付されている場合に、構文解析部５２が、ＸＭＬ文書（の実体）それ自体ではなくて当該スキーマ情報に基づいて当該ＸＭＬ文書の構造を予め解析する点にある。 [Modification]
Next, a modification of the above embodiment will be described. A feature of this modification is that schema information (document type definition information) called document type definition (DTD) that defines (describes) the structure of the XML document is attached to the XML document (its substance). In this case, the syntax analysis unit 52 parses the structure of the XML document in advance based on the schema information instead of the XML document (its entity) itself.

図１４は、図６（図５）のＸＭＬ文書の構造を定義したＤＴＤ（文書型定義情報）の一例を示す。ＤＴＤの１行目の
「<！ＥＬＥＭＥＮＴ特許（名称，要旨）>」
は、<特許>要素には、<名称>要素及び<要旨>要素が、この順番に出現しなければならないことを示す。 FIG. 14 shows an example of DTD (document type definition information) that defines the structure of the XML document of FIG. 6 (FIG. 5). “<! ELEMENT patent (name, abstract)>” on the first line of DTD
Indicates that in the <patent> element, the <name> element and the <subject> element must appear in this order.

ＤＴＤの２行目の
<！ＥＬＥＭＥＮＴ名称（＃ＰＣＤＡＴＡ）>
は、<名称>要素に文字データ（＃ＰＣＤＡＴＡ）が出現することを示す。 Second line of DTD
<! ELEMENT Name (#PCDATA)>
Indicates that character data (#PCDATA) appears in the <name> element.

同様に、ＤＴＤの３行目の
<！ＥＬＥＭＥＮＴ要旨（＃ＰＣＤＡＴＡ）>
は、<要旨>要素に文字データ（＃ＰＣＤＡＴＡ）が出現することを示す。 Similarly, the third line of DTD
<! ELEMENT Abstract (#PCDATA)>
Indicates that character data (#PCDATA) appears in the <summary> element.

構文解析部５２は、このＤＴＤを解析することで、<特許>タグが最上位（１段目）の構造をなし、<名称>タグ及び<要旨>タグが２段目の構造をなすことを予め認識することができる。<名称>タグ及び<要旨>タグは、それぞれ、文字データが出現する名称要素及び要旨要素のタグである。つまり構文解析部５２は、ＸＭＬ文書（の実体）に付されているＤＴＤを解析することで、文字データが出現する要素（<名称>要素及び<要旨>要素）と、その構造（階層の深さ）を予め認識することができる。 The parsing unit 52 analyzes the DTD, so that the <patent> tag has the highest structure (first level), and the <name> tag and the <summary> tag have the second level structure. It can be recognized in advance. The <name> tag and the <summary> tag are tags of a name element and an abstract element in which character data appears, respectively. In other words, the syntax analysis unit 52 analyzes the DTD attached to the XML document (its substance), thereby generating elements (<name> element and <summary> element) in which character data appears and their structures (depth of the hierarchy). Can be recognized in advance.

このため、認識された構造を前提に、文書読込部５１によって読み込まれる図６（図５）に示すＸＭＬ文書（の実体）を解析するならば、構文解析部５２は、文字データが出現する<名称>要素及び<要旨>要素にそれぞれ含まれるテキストノードのみ、つまり<名称>タグに後続する文字列及び<要旨>タグに後続する文字列のみを解析対象とすれば良い。 Therefore, if the XML document (entity) shown in FIG. 6 (FIG. 5) read by the document reading unit 51 is analyzed on the premise of the recognized structure, the syntax analysis unit 52 causes character data to appear < Only the text nodes included in the name> element and the <summary> element, that is, only the character string following the <name> tag and the character string following the <summary> tag may be analyzed.

なお、本発明は、上記実施形態またはその変形例そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。例えば、上記実施形態及びその変形例では、構造化文書としてＸＭＬ文書を例にとって説明したが、これに限るものではない。本発明は、例えば、ＳＧＭＬ（Standard Generalized Markup Language）文書のようなＸＭＬ文書以外の構造化文書にも同様に適用できる。 In addition, this invention is not limited to the said embodiment or its modification example as it is, A component can be deform | transformed and embodied in the range which does not deviate from the summary in an implementation stage. For example, in the above-described embodiment and its modification, the XML document is described as an example of the structured document, but the present invention is not limited thereto. The present invention can be similarly applied to structured documents other than XML documents, such as SGML (Standard Generalized Markup Language) documents.

また、上記実施形態またはその変形例に開示されている複数の構成要素の適宜な組み合わせにより種々の発明を形成できる。例えば、実施形態またはその変形例に示される全構成要素から幾つかの構成要素を削除してもよい。 In addition, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the embodiment or its modification. For example, you may delete a some component from all the components shown by embodiment or its modification.

本発明の一実施形態に係るクライアント−サーバシステムのハードウェア構成を示すブロック図。The block diagram which shows the hardware constitutions of the client-server system which concerns on one Embodiment of this invention. 図１に示されるＸＭＬデータベース管理システムの主として機能構成を示すブロック図。The block diagram which mainly shows the function structure of the XML database management system shown by FIG. ＸＭＬデータベースに格納される索引のデータ構造例をＸＭＬ文書と対応付けて示す図。The figure which shows the data structure example of the index stored in an XML database in association with an XML document. 上記実施形態で適用される整形用空白文字判定を含む構文解析処理の手順を示すフローチャートの一部を示す図。The figure which shows a part of flowchart which shows the procedure of the parsing process including the blank character determination for formatting applied in the said embodiment. 上記実施形態で適用される整形用空白文字判定を含む構文解析処理の手順を示すフローチャートの残りを示す図。The figure which shows the remainder of the flowchart which shows the procedure of the parsing process including the blank character determination for formatting applied in the said embodiment. 文書読込部によって読み込まれるＸＭＬ文書の一例を示す図。The figure which shows an example of the XML document read by the document reading part. 図５に示すＸＭＬ文書を、当該ＸＭＬ文書に含まれる改行文字及びタブ文字が記号「↓」及び「ｔ」で置き換えられた形式で示す図。FIG. 6 is a diagram showing the XML document shown in FIG. 5 in a format in which line feed characters and tab characters included in the XML document are replaced with symbols “↓” and “t”. 図６のＸＭＬ文書の構文解析処理に伴う空白文字情報蓄積部の内容の変化を示す図。The figure which shows the change of the content of the blank character information storage part accompanying the syntax analysis process of the XML document of FIG. 図６のＸＭＬ文書の構文解析処理に伴う空白文字一時格納部の内容の変化を示す図。The figure which shows the change of the content of the blank character temporary storage part accompanying the syntax analysis process of the XML document of FIG. 文書読込部によって読み込まれるＸＭＬ文書の他の例を示す図。The figure which shows the other example of the XML document read by the document reading part. 図９に示すＸＭＬ文書を、当該ＸＭＬ文書に含まれる改行文字及びタブ文字が記号「↓」及び「ｔ」で置き換えられた形式で示す図。FIG. 10 is a diagram showing the XML document shown in FIG. 9 in a format in which line feed characters and tab characters included in the XML document are replaced with symbols “↓” and “t”. 図１０のＸＭＬ文書の構文解析処理に伴う空白文字情報蓄積部の内容の変化を示す図。The figure which shows the change of the content of the blank character information storage part accompanying the syntax analysis process of the XML document of FIG. 図１０のＸＭＬ文書の構文解析処理に伴う空白文字一時格納部の内容の変化を示す図。The figure which shows the change of the content of the blank character temporary storage part accompanying the syntax analysis process of the XML document of FIG. 上記実施形態で適用される索引を利用したＸＭＬ文書検索の手順を示すフローチャート。The flowchart which shows the procedure of the XML document search using the index applied in the said embodiment. 上記実施形態の変形例で適用される文書型定義情報の一例を示す図。The figure which shows an example of the document type definition information applied in the modification of the said embodiment.

Explanation of symbols

１０…データベースサーバ、１１…メモリ、２０…クライアント端末、３０…ネットワーク、４０…外部記憶装置、４１…ＸＭＬデータベース管理プログラム、４２…ＸＭＬデータベース、５０…ＸＭＬデータベース管理システム、５１…文書読込部、５２…構文解析部（イグノラブル空白文字判定手段）、５３…空白文字判定部、５４…空白文字一時格納部、５５…空白文字情報蓄積部、５６…索引作成部、５７…文書格納処理部、４２１…ＸＭＬ文書集合、４２２…索引、５３１…第１の空白文字判定部、５３２…第２の空白文字判定部、５３３…分解部。 DESCRIPTION OF SYMBOLS 10 ... Database server, 11 ... Memory, 20 ... Client terminal, 30 ... Network, 40 ... External storage device, 41 ... XML database management program, 42 ... XML database, 50 ... XML database management system, 51 ... Document reading part, 52 ... Parsing unit (ignorable blank character determination means) 53. Blank character determination unit 54. Blank character temporary storage unit 55. Blank character information storage unit 56. Index creation unit 57 57 Document storage processing unit 421 XML document set, 422... Index, 531... First blank character determination unit, 532... Second blank character determination unit, 533.

Claims

A parsing means for analyzing a structure of a structured document to be stored in a structured document database, in which a hierarchical structure is expressed using a tag, and detecting text appearing after the tag;
Blank character information storage for storing a blank character string as a formatting blank character that appears in a structured document analyzed by the syntax analysis means in association with structure information representing a structure associated with the blank character string. Means,
An ignorable blank character judging means for judging whether or not the text detected by the parsing means is an ignorable blank character composed only of white space characters;
When the text is the ignorable whitespace character and the ignorable whitespace character is followed by a start tag, the ignorable whitespace character is the formatting whitespace character related to the structure of the next element including the start tag. First blank character determination means for confirming and storing in the blank character information storage means in association with the structure information representing the structure;
When the text is not the ignorable white space character, the disassembling means for decomposing the text into a white space character string and a text portion not including the white space character string,
At least, determine that the empty string at the beginning of the text that has been decomposed by the decomposition means is operable to determine whether the appearance, is the empty string if the empty string has appeared is the shaping space character A second blank character determination unit that stores the blank character string in the blank character information storage unit in association with structure information representing a structure associated with the blank character string ;
An index used for searching the structured document including the text is created based on the text portion obtained by removing the formatting blank character and the confirmed blank character string from the text, and stored in the structured document database. Indexing means to
A structured document database management system comprising: a document storage processing unit that stores the structured document including the text in the structured document database.

The second blank character judging means is a case where a blank character string appears at the head of the text decomposed by the decomposing means, and the character string length of the blank character string is 1 of the structure including the text. Determining that the blank character string is the formatting blank character if it is longer than the character string length of the blank character string stored in the blank character information storage means in association with the structure information at a higher level. The structured document database management system according to claim 1, wherein:

The blank character string stored in the blank character information storage means corresponds to flag information indicating whether the blank character string is determined to be the formatting blank character or the formatting blank character candidate. Attached,
The second blank character determining means, when storing the blank character string determined to be the formatting blank character in the blank character information storage means, corresponds to flag information indicating confirmation to the blank character string. In the case where a blank character string appears in the middle part of the text decomposed by the disassembling means, the blank character information is stored in association with the blank character string and the structure information related to the blank character string. It is determined whether it is stored in the means, and at least if it is not stored, the blank character string is associated with the structure information of the structure related to the blank character string and flag information indicating candidates, and the blank character information is stored. accumulate means, if it is accumulated and and associated with the flag information indicating the candidate, according to claim 1, wherein the change in state indicating determine the flag information Structured document database management system.

The second blank character judging means is a case where the blank character string appearing in the middle part of the text and the structure information of the structure related to the blank character string are not associated and stored in the blank character information storage means. And the character string length of the blank character string is longer than the character string length of the blank character string stored in the blank character information storage unit in association with the structure information one level higher than the structure including the text. 4. The structure according to claim 3 , wherein the blank character string is stored in the blank character information storage unit in association with the structure information of the structure related to the blank character string and flag information indicating a candidate. Document database management system.

When it is determined by the ignorable blank character determination means that it is an ignorable blank character, and when a blank character string appears at the end of the text decomposed by the decomposition means, the corresponding blank character string is formatted. A blank character temporary storage means for temporarily storing the blank character in association with the structure information representing the structure of the element including the blank character string,
The first blank character determination means, when a start tag following the blank character string stored in the blank character temporary storage means is detected by the syntax analysis means, the blank character string as a formatting blank character , structured document database management system of claim 1, wherein the in association with the structural information indicating the structure of the element containing the begin tag accumulated in the space character information Ho蓄 product means.

A document reading means for reading the structured document;
The structured document database management system according to claim 1, wherein the syntax analysis unit analyzes the structure of the structured document read by the document reading unit.

A document reading means for reading the structured document;
The structured document has document type definition information defining the structure of the structured document,
The parsing means analyzes the document type definition information, thereby recognizing in advance the elements and structures of character data in the structured document to which the document type definition information is attached, and by the document reading means. 2. The structured document database management system according to claim 1, wherein when the structured document is read, the text following the tag of the recognized element in the structured document is analyzed.

A database server computer for managing a structured document database in which a structured document in which a hierarchical structure is expressed using tags is stored, and appears in the structured document to be stored in the structured document database; A database server computer having blank character information storage means for storing a blank character string as a formatting blank character in association with structure information representing a structure related to the blank character string,
Parsing means for analyzing the structure of the structured document to be stored in the structured document database and detecting text appearing after the tag;
An ignorable blank character judging means for judging whether or not the text detected by the parsing means is an ignorable blank character composed only of white space characters;
When the text is the ignorable whitespace character and the ignorable whitespace character is followed by a start tag, the ignorable whitespace character is the formatting whitespace character related to the structure of the next element including the start tag. First blank character determination means for confirming and storing in the blank character information storage means in association with the structure information representing the structure;
When the text is not the ignorable white space character, the disassembling means for decomposing the text into a white space character string and a text portion not including the white space character string,
At least, determine that the empty string at the beginning of the text that has been decomposed by the decomposition means is operable to determine whether the appearance, is the empty string if the empty string has appeared is the shaping space character A second blank character determination unit that stores the blank character string in the blank character information storage unit in association with structure information representing a structure associated with the blank character string ;
An index used for searching the structured document including the text is created based on the text portion obtained by removing the formatting blank character and the confirmed blank character string from the text, and stored in the structured document database. Indexing means to
A program for causing the structured document including the text to function as document storage processing means for storing the structured document in the structured document database.