WO2009113289A1 - 新規事例生成装置、新規事例生成方法及び新規事例生成用プログラム - Google Patents
新規事例生成装置、新規事例生成方法及び新規事例生成用プログラム Download PDFInfo
- Publication number
- WO2009113289A1 WO2009113289A1 PCT/JP2009/001046 JP2009001046W WO2009113289A1 WO 2009113289 A1 WO2009113289 A1 WO 2009113289A1 JP 2009001046 W JP2009001046 W JP 2009001046W WO 2009113289 A1 WO2009113289 A1 WO 2009113289A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- case
- new
- context
- new case
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
Definitions
- the present invention relates to a new case generation apparatus, a new case generation method, and a new case generation program, and in particular, can generate a new case of the same type as the case based on the input case.
- the present invention relates to a generation method and a new case generation program.
- Patent Document 1 describes an example of an information extraction device relating to a technique for preventing such a reduction in accuracy of information extraction rules.
- a score indicating the probability of the information to be extracted for each extraction result in consideration of the evaluation scale regarding the accuracy of the information extraction rule is obtained. Calculated. Then, by removing an extraction result having a low score, a reduction in accuracy of the extraction result is prevented.
- scoring is performed according to the degree of inclusion of an input word / phrase with respect to a case searched by a search process.
- a case-based reasoning method for rearranging cases in descending order is described.
- the extraction result extracted by the information extraction device is only information extracted by an information extraction rule created based on a case given in advance. For this reason, since information that can be extracted even if used as a new case is biased, there is a limit to improving the completeness of the information extraction rule.
- the present invention provides a new case generation apparatus and a new case that can accurately generate a new case that can accurately generate a new case of the same type as the case of the information that is to be extracted based on the case of the information that is to be extracted.
- An object is to provide a generation method and a new case generation program.
- the new case generation apparatus receives as input the case of information to be extracted and the case context that is surrounding text data including the case, and based on the input case and the case context, A new case generation means for generating, by using document data, a new case that is a new case, and a new case context that is surrounding text data including the new case and that is different from the case context; Similarity calculation means for calculating similarity between case contexts and new case narrowing means for narrowing and outputting new cases generated by the new case generation means based on the similarity calculated by the similarity calculation means It is characterized by having.
- the new case generation method accepts, as input, a case of information to be extracted and a case context that is surrounding text data including the case, and based on the input case and the case context, A new case that is a new case and surrounding text data including the new case, and a new case context that is different from the case context are generated using the document data, and the case context and the new case context are The similarity is calculated, and the generated new cases are narrowed down and output based on the calculated similarity.
- the program for generating a new case accepts, as an input, a case of information to be extracted and a case context that is surrounding text data including the case, based on the input case and the case context.
- a new case generation process for generating, using document data, a new case that is a new case of the same type as the case, and a new case context that is surrounding text data including the new case and is different from the case context;
- a similarity calculation process for calculating the similarity between the case context and the new case context, and a new case narrowing process for narrowing and outputting the generated new case based on the calculated similarity. is there.
- FIG. 1 is a block diagram showing an example of the configuration of a new case generation apparatus according to the present invention.
- the new case generation apparatus includes a data input unit 11, a new case generation unit 12, a similarity calculation unit 13, and a new case narrowing unit 14.
- the data input unit 11 inputs a case that is information to be extracted and a case context that is surrounding text data including the case.
- the new case generation unit 12 extracts information as a new case candidate from the document data according to the condition based on the input case as a new case, and is surrounding text data including the new case, which is different from the case context. Create a new case context.
- the similarity calculation unit 13 calculates the similarity between the case context and the new case context.
- the new case narrowing unit 14 narrows and outputs new cases based on the similarity calculated by the similarity calculation unit 13.
- the similarity calculation unit 13 calculates the similarity between the case context and the new case context, and the degree of pattern difference between the data that is a part of the case context and the data that is a part of the new case context.
- the new case narrowing unit 14 narrows down and outputs new cases based on the similarity and the pattern dissimilarity calculated by the similarity calculation unit 13.
- the new case generation device is specifically realized by an information processing device such as a personal computer that operates according to a program.
- Each processing unit shown in FIG. 1 operates as outlined below.
- the data input unit 11 is specifically realized by a CPU of an information processing apparatus that operates according to a program.
- the data input unit 11 has a function of receiving, as an input, a case context that is surrounding text data including a case that is information to be extracted.
- the data input unit 11 inputs a case (for example, a famous politician name or a famous case name) to be extracted from an input device such as a keyboard or a mouse in accordance with a user operation. Then, the data input unit 11 extracts and inputs the case context including the input case from the document data stored in the document database in advance.
- a case for example, a famous politician name or a famous case name
- the new case generation unit 12 is specifically realized by a CPU of an information processing apparatus that operates according to a program.
- the new case generation unit 12 has a function of extracting information as a new case candidate from the document data as a new case according to the condition based on the case input by the data input unit 11. Further, the new case generation unit 12 has a function of generating new case context that is peripheral text data including the extracted new case and is different from the case context.
- the new case generation unit 12 converts, for example, a new case that has the same character string as the character string corresponding to the case and uses text data different from the case context of the case as a new case context as document data. Use to generate.
- the new case generation unit 12 has, for example, a new case in which text data that has the same morpheme sequence pattern as the predetermined pattern of the morpheme sequence corresponding to the case and is different from the case context of the morpheme sequence is used as the new case context. Cases may be generated using document data.
- the new case generation unit 12 includes, as new case context, text data including at least one of a predetermined number of character strings, morphemes, sentences, or paragraphs existing around the new case. It may be generated.
- the similarity calculation unit 13 is specifically realized by a CPU of an information processing device that operates according to a program.
- the similarity calculation unit 13 has a function of calculating a topic similarity between the case context input by the data input unit 11 and the new case context generated by the new case generation unit 12.
- the similarity calculation unit 13 may include a function of calculating a pattern difference degree between data that is a part in the case context and data that is a part in the new case context in addition to the similarity. .
- the new case narrowing unit 14 is specifically realized by a CPU of an information processing apparatus that operates according to a program.
- the new case narrowing unit 14 has a function of narrowing down new cases generated by the new case generation unit 12 based on the similarity calculated by the similarity calculation unit 13.
- the new case narrowing unit 14 has a function of narrowing down the new cases generated by the new case generation unit 12 based on the similarity and the pattern difference calculated by the similarity calculation unit 13.
- the new case narrowing unit 14 has a function of outputting the narrowed-down new cases. In this case, for example, the new case narrowing unit 14 displays the narrowed-down new cases on a display device such as a display device.
- the storage device (not shown) of the new case generation apparatus stores various programs for generating a new case of the same type as the input case.
- the storage device of the new case generation device accepts, as input, a case and a case context that is surrounding text data including the case, and based on the input case and the case context, A new case that is a new case, and a new case generation process that uses document data to generate a new case context that is different from the case context that is surrounding text data including the new case, and a case context Similarity calculation processing for calculating the degree of similarity between the new case context, the pattern dissimilarity between the data that is part of the case context and the data that is part of the new case context, A new case generation program for executing a new case narrowing process for narrowing and outputting a generated new case based on the degree of pattern difference is stored.
- FIG. 2 is a flowchart illustrating an example of processing for generating a new case of the same type as the case input by the new case generation apparatus.
- the data input unit 11 accepts as input a case context that is peripheral text data including a case that is information to be extracted (step A1 shown in FIG. 2). For example, when a case input operation is performed by the user, the data input unit 11 inputs a case to be extracted, and starts a new case generation process after step A1.
- the new case generation unit 12 sets conditions for extracting the case context based on the case input by the data input unit 11. Further, the new case generation unit 12 extracts information as a new case candidate from the document data (for example, document data stored in the document database in advance) as a new case according to the set conditions. Then, the new case generation unit 12 compares the text data around the extracted new case with the case context, and adopts the new case when it is different from the case context, and further adopts the new case from the text data around the new case. A case context is generated (step A2). Since the new case generated here has a context different from the case context as the new case context, the new case and the new case context can be used to generate the information extraction rule. Information extraction rules can be generated. If the text data around the new case and the case context are the same, the completeness of the information extraction rule cannot be improved even if the case is adopted as a new case. Therefore, the new case is discarded without being adopted.
- the new case generation unit 12 sets conditions for extracting the case context based on the
- the similarity calculation unit 13 calculates the similarity between the case context input by the data input unit 11 and the new case context generated by the new case generation unit 12 (step A3). Alternatively, the similarity calculation unit 13 calculates the pattern dissimilarity between data that is a part in the case context and data that is a part in the new case context, in addition to the similarity.
- the new case narrowing unit 14 narrows down new cases based on the similarity calculated by the similarity calculation unit 13. Alternatively, the new case narrowing unit 14 narrows down new cases based on the similarity and the pattern difference calculated by the similarity calculation unit 13. Then, the new case narrowing unit 14 outputs the narrowed-down new case (Step A4). For example, the new case narrowing unit 14 displays the narrowed-down new cases on the display device together with the new case context.
- the new case narrowing unit 14 may extract new case contexts from the top as a narrowing result by arranging new case contexts in descending order of similarity, for example. Further, the new case narrowing unit 14 may extract, as a narrowing result, new cases included in the new case context whose similarity exceeds a predetermined value, for example. Alternatively, as a narrowing down method, the new case narrowing unit 14 arranges new case branches in the order of high similarity and pattern dissimilarity, and extracts a predetermined number of new case contexts as narrowing results from the top. Also good.
- the new case generation device generates a new case that is a candidate for a new case based on the case of information to be extracted, and generates a new case context that is different from the case context. To do. Further, the new case generation device calculates the similarity between the case context and the generated new case context. Alternatively, the new case generation apparatus calculates a pattern dissimilarity between data that is a part in the case context and data that is a part in the new case context, in addition to the similarity. By doing so, new cases are narrowed down based on the degree of similarity or the degree of similarity and the degree of pattern difference.
- whether or not the context between the case context and the new case context is similar is calculated by calculating the similarity between the case context and the new case context. If the contexts are similar, the similarity of the new case context including the new case is high, so by narrowing down to new cases included in the new case context with a high degree of similarity, it is similar to the case and the case It is possible to accurately generate a new case having a context different from the context.
- the similarity between the case context and the new case context is calculated, and the pattern dissimilarity between the data that is part of the case context and the data that is part of the new case context is calculated.
- the new case generation apparatus For example, let us consider a case where a case of “President Bush visits Japan” is input as an input case.
- the new case generation apparatus generates cases such as “Mr. Bush” and “Bush de Noel” as candidates for the new case.
- the new case generation apparatus obtains a similarity between the new case context including “Mr. Bush” and “Bush de Noel” and the case context including “President Bush”. Then, the new case generation apparatus extracts and outputs new cases by narrowing down to “Mr. Bush” based on the high degree of similarity.
- the contexts before and after the cases are included are compared and the new cases are narrowed down and extracted. Cases can be generated and output with high accuracy.
- the context before and after “President Bush” and “Mr. Bush” are thought to contain many words related to politics, whereas “Bush de Noel” In the context before and after the inclusion, it seems that there are no words related to politics, including words related to cake and Christmas. Therefore, by comparing similarities between contexts, “Bush de Noel” having low relevance can be excluded from new cases, and new cases related to input cases can be generated and output with high accuracy.
- FIG. 3 is a block diagram illustrating a configuration example of the new case generation apparatus according to the second embodiment.
- the new case generation apparatus includes a data input unit 11A, an extraction rule application unit 15, a new case generation unit 12, a similarity calculation unit 13, and a new case narrowing unit 14.
- this embodiment is different from the first embodiment in that the new case generation apparatus includes an extraction rule application unit 15 in addition to the components shown in FIG. 1.
- the function of the data input unit 11A is different from the function of the data input unit 11 shown in the first embodiment.
- the data input unit 11A inputs information extraction rules.
- the extraction rule application unit 15 obtains a case and a case context that is surrounding text data including the case from the extraction result obtained by applying the information extraction rule to the document data.
- the new case generation unit 12 extracts information as a new case candidate from the document data according to the condition based on the acquired case, and is text data around the new case, which is different from the case context. Create a new case context.
- the similarity calculation unit 13 calculates the similarity between the case context and the new case context.
- the new case narrowing unit 14 narrows and outputs new cases based on the similarity calculated by the similarity calculation unit 13.
- the similarity calculation unit 13 calculates the similarity between the case context and the new case context, and further, the pattern difference between the data that is a part in the case context and the data that is a part in the new case context Calculate the degree.
- the new case narrowing unit 14 narrows down and outputs new cases based on the similarity and the pattern difference calculated by the similarity calculation unit 13.
- Each processing unit shown in FIG. 3 operates as outlined below.
- the data input unit 11A is specifically realized by a CPU of an information processing apparatus that operates according to a program.
- the data input unit 11A has a function of accepting as an input an information extraction rule that is a rule for extracting a case to be extracted.
- the extraction rule applying unit 15 is realized by a CPU of an information processing apparatus that operates according to a program.
- the extraction rule application unit 15 has a function of extracting a case by applying the information extraction rule input by the data input unit 11 to document data.
- the extraction rule application unit 15 has a function of acquiring a case context that is surrounding text data including a case based on the extraction result (case).
- the extraction rule application unit 15 extracts a case that matches the information extraction rule from the document data stored in the document database in advance. Then, the case context including the extracted case is extracted from the document data stored in the document database.
- the new case generation unit 12 is specifically realized by a CPU of an information processing apparatus that operates according to a program.
- the new case generation unit 12 has a function of extracting information as a new case candidate from the document data as a new case according to the condition based on the case generated by the extraction rule application unit 15.
- the new case generation unit 12 has a function of generating new case context that is peripheral text data including the extracted new case and is different from the case context.
- the similarity calculation unit 13 is specifically realized by a CPU of an information processing device that operates according to a program.
- the similarity calculation unit 13 has a function of calculating topic similarity between the case context extracted by the extraction rule application unit 15 and the new case context generated by the new case generation unit 12.
- the similarity calculation unit 13 has a function of calculating the similarity, and further calculates a pattern dissimilarity between data that is a part of the case context and data that is a part of the new case context. Is provided.
- the new case narrowing unit 14 is specifically realized by a CPU of an information processing apparatus that operates according to a program.
- the new case narrowing unit 14 has a function of narrowing down new cases generated by the new case generation unit 12 based on the similarity calculated by the similarity calculation unit 13.
- the new case narrowing unit 14 has a function of narrowing down the new cases generated by the new case generation unit 12 based on the similarity and the pattern difference calculated by the similarity calculation unit 13.
- the new case narrowing unit 14 has a function of outputting the narrowed-down new cases. In this case, for example, the new case narrowing unit 14 displays the narrowed-down new cases on a display device such as a display device.
- FIG. 4 is a flowchart illustrating a processing example for generating a new case of the same type as the extraction result based on the information extraction rule input by the new case generation apparatus according to the second embodiment.
- the data input unit 11A receives an information extraction rule for extracting information to be extracted as an input (step B1 shown in FIG. 4). For example, when an information extraction rule input operation is performed by the user, the data input unit 11A inputs the information extraction rule and starts a new case generation process after step B1.
- the extraction rule application unit 15 applies the information extraction rule input by the data input unit 11A to the document data, and extracts an extraction target case. Further, the extraction rule application unit 15 extracts a case context which is surrounding text data including the case by using the obtained extraction result as a case (step B2).
- the new case generation unit 12 sets the condition for extracting the case context based on the extraction result extracted by the extraction rule application unit 15 as a case. Further, the new case generation unit 12 extracts information as a new case candidate from the document data (for example, document data stored in the document database in advance) as a new case according to the set condition. Then, the new case generation unit 12 compares the text data around the extracted new case with the case context, and adopts the new case when it is different from the case context, and further adopts the new case from the text data around the new case. Let it be a case context (step B3).
- the similarity calculation unit 13 calculates the similarity between the case context extracted by the extraction rule application unit 15 and the new case context generated by the new case generation unit 12 (step B4). Alternatively, the similarity calculation unit 13 calculates the pattern dissimilarity between data that is a part in the case context and data that is a part in the new case context, in addition to the similarity.
- the extraction rule application unit 15 may store the extracted case context in a case storage unit (for example, a buffer formed in the RAM). Further, the new case generation unit 12 may store the generated new case context in a new case storage unit (for example, a buffer formed in the RAM).
- the similarity calculation unit 13 preliminarily stores the case context stored in the case storage unit, the new case context stored in the new case storage unit, and the document storage unit (for example, a buffer formed in the RAM). The degree of similarity or the degree of pattern difference may be calculated with reference to the stored document data.
- the new case narrowing unit 14 narrows down new cases based on the similarity calculated by the similarity calculation unit 13. Alternatively, the new case narrowing unit 14 narrows down new cases based on the similarity and the pattern difference calculated by the similarity calculation unit 13. Then, the new case narrowing unit 14 outputs the narrowed-down new case as an extraction result (Step B5). For example, the new case narrowing unit 14 displays the narrowed-down new cases on the display device.
- the new case generation apparatus extracts the case context from the extracted information by applying the information extraction rule to the document. Further, the new case generation device generates a new case context different from the case context based on the case, the similarity of the topic between the case context and the new case context, and data that is a part of the case context Calculate the pattern dissimilarity with the data that is part of the new case context. And by doing so, it narrows down to the new example which has a context with high similarity. Or it narrows down to the new example which has a branch with a high degree of similarity and pattern difference.
- FIG. 5 is a block diagram illustrating a configuration example of the new case generation apparatus according to the third embodiment. As shown in FIG. 5, this embodiment is different from the second embodiment in that the new case generation apparatus includes an extraction rule generation unit 16 in addition to the components shown in FIG. 3. In the present embodiment, the function of the new case narrowing unit 14A is different from the function of the new case narrowing unit 14 shown in the second embodiment.
- the new case narrowing unit 14A is specifically realized by a CPU of an information processing apparatus that operates according to a program.
- the new case narrowing unit 14A has a function of narrowing down the new cases generated by the new case generation unit 12 based on the similarity calculated by the similarity calculation unit 13 or the similarity and the pattern dissimilarity. Further, the new case narrowing unit 14A has a function of outputting the narrowed-down new cases. In this case, for example, the new case narrowing unit 14A displays the narrowed-down new cases on a display device such as a display device.
- the new case narrowing unit 14A has a function of passing (outputting) the narrowing result of the new case to the extraction rule generating unit 16.
- the extraction rule generation unit 16 is realized by a CPU of an information processing apparatus that operates according to a program.
- the extraction rule generation unit 16 has a function of generating an information extraction rule for extracting a new case narrowed down by the new case narrowing unit 14A.
- the extraction rule generation unit 16 has a function of outputting the generated information extraction rule.
- the extraction rule generation unit 16 displays the generated information extraction rule on a display device such as a display device.
- the extraction rule generation unit 16 may pass (generate) the generated information extraction rule to the data input unit 11 and may be used as an input of the next information extraction rule.
- the functions of the data input unit 11A, the extraction rule application unit 15, the new case generation unit 12, and the similarity calculation unit 13 are the same as those described in the second embodiment.
- FIG. 6 is a flowchart illustrating a processing example for generating a new case of the same type as the case input by the new case generation apparatus according to the third embodiment.
- the operations performed by the data input unit 11A, the extraction rule application unit 15, the new case generation unit 12, and the similarity calculation unit 13 shown in steps C1 to C4 in FIG. 6 are shown in steps B1 to B4 in FIG. Since the operations are the same as those performed by the data input unit 11A, the extraction rule application unit 15, the new case generation unit 12, and the similarity calculation unit 13, a description thereof will be omitted.
- the new case narrowing unit 14 In the second embodiment, a case where the new case narrowing unit 14 outputs the narrowing result of the new case based on the similarity or the similarity and the pattern dissimilarity calculated by the similarity calculating unit 13 in step B5. Indicated.
- the new case narrowing unit 14A not only outputs the result of narrowing down the new cases, but also passes it to the extraction rule generation unit 16 (step C5 shown in FIG. 6).
- the new case narrowing unit 14A is not limited to a new case that has been narrowed down, but a new case that has been excluded by narrowing down, in order to increase the accuracy of information extraction rule generation performed by the extraction rule generation unit 16.
- Information such as the degree of similarity used for narrowing down determination may also be passed (output).
- the extraction rule generation unit uses a new case excluded by narrowing down as a negative example, or uses it to preferentially extract many new cases with a high degree of similarity or similarity and pattern dissimilarity. Thus, the accuracy of the information extraction rule can be increased.
- the extraction rule generation unit 16 generates an information extraction rule for extracting the extraction result (the narrowed new case) by the new case narrowing unit 14A. And the extraction rule production
- the information extraction rule may be output in step C6 and the process may be terminated.
- the new case generation apparatus further performs the processing of the following steps by a bootstrap technique. Do.
- the extraction rule generation unit 16 determines whether the end condition is satisfied (step C7). If the end condition is satisfied, the process ends as it is. If the end condition is not satisfied, the extraction rule generation unit 16 passes (outputs) the generated information extraction rule to the data input unit 11A. The data input unit 11A uses the information extraction rule from the extraction rule generation unit 16 as the next input.
- the extraction rule generation unit 16 determines whether or not an information extraction rule has been generated, ends the case where the information extraction rule is not generated, and performs processing while it is generated. It may be continued. Further, as a method of determining the end condition, for example, the extraction rule generation unit 16 sets in advance the number of cycles for repeating the processes of steps C1 to C7, and ends when the set number of cycles is reached. May be. In addition, for example, the extraction rule generation unit 16 sets the number of information extraction rules to be generated in advance, accumulates the number of generated information extraction rules, and ends when the set number of information extraction rules is reached. May be. However, the determination method of the end condition is not limited to the method shown in the present embodiment, and the extraction rule generation unit 16 may determine the end condition using another method.
- the extraction rule generation unit 16 generates a new information extraction rule using the extraction result of the new case narrowing unit 14A. Because it is configured in this way, it can extract not only new information of the same type as the information extracted by the first input information extraction rule but also the same type of information extracted by the first input information extraction rule. New information extraction rules can be obtained.
- the data input unit, extraction rule application unit, new case generation unit, similarity calculation unit, new case narrowing unit, and extraction rule generation unit shown in the first to third embodiments are realized as separate units. May be.
- the new case generation device is realized by a computer.
- the computer is a data processing device such as a personal computer or a workstation. Further, the computer is connected to an input device such as a keyboard, and an input interface unit for outputting an operation signal of the input device to the CPU, an output device such as a ROM (Read Only Memory), a RAM (Random Access Memory), and a display device It includes known components such as an output interface unit for connection, a hard disk device (HD: Hard Disk), and a CPU (Central Processing Unit).
- HD Hard Disk
- CPU Central Processing Unit
- the ROM stores a program that controls basic control of each part of the new case generation device.
- the program may be stored in an external storage device.
- the RAM is used as a work area of the CPU, and temporarily stores programs executed by the CPU and various data.
- the program in the ROM is read into the RAM, and the CPU operates according to the control of the program read into the RAM.
- the CPU functions as each processing unit such as the data input unit 11, the new case generation unit 12, the similarity calculation unit 13, and the new case narrowing unit 14.
- the CPU generates, as buffers, a document storage unit that stores document data in the RAM, a case storage unit that stores case contexts, and a new case storage unit that stores new case contexts.
- HD stores various software for controlling a computer such as an operating system.
- the document data may be stored in the HD in advance, and a necessary document may be read from the HD as needed in the RAM during operation.
- FIG. 7 is an explanatory diagram showing an example of document data.
- the document data shown in FIG. 7 is read from an external storage device or the like and stored in the document storage unit.
- the document storage unit stores a document ID, which is an identifier for identifying document data, and text data, which is a document entity, in association with each other.
- the document storage unit associates with the document ID “DOC1” and the document content is “the XX party member of the XX party says ⁇ ”. Assume that document text data including a plurality of sentences including sentences is stored.
- the document text data may be an electronic file such as an HTML file, electronic mail, or a word processor document.
- the CPU may extract and store only text data from these electronic files in advance or store the text data and other information in a format that can be identified.
- the document storage unit may store information in a format divided into sentence units as document contents.
- the document storage unit stores an analysis result obtained by analyzing the text data by a language analysis process such as morphological analysis or syntax analysis in association with the text data. Also good.
- the CPU When the execution of the program is started, the CPU functions as the data input unit 11 and receives the information shown in FIG. 8 as an input.
- FIG. 8 shows an example of data of cases and case contexts, and the CPU inputs the information shown in FIG. 8 and stores it in the case storage unit.
- the CPU identifies a case ID that is an identifier for identifying a case, case context text data that is an instance of a case context including the case, and a corresponding portion of the case in the case context text data. Is stored in the case storage unit in a format in which the position information indicating the type of the case is associated with the type of the case. Furthermore, as shown in FIG. 8, the CPU may also store the case contents that are locations in the text data corresponding to the case in the case storage unit in association with each other.
- the location information indicates a corresponding portion of information to be extracted as a case, and can be expressed in a format indicated by offset information in the case context text data.
- the position information may be only offset information in the case context text data.
- the position information may be indicated in a format composed of offset information at the front end and the end in the case context text data.
- the position information may be indicated in a format that explicitly indicates the offset information and length information at the head of the information to be extracted from the case context text data.
- a tag indicating a case may be added to the case context text data so that the case location can be identified.
- the format of the position information stored in the case storage unit is not limited to the storage format shown in the present embodiment.
- the CPU stores the case context in association with the case.
- the case context corresponding to the case ID “EX1”, based on the position information “4, 3”, in the case context text data, the beginning of the context is 0, and the long character starts. It can be seen that the contents of the case are located at a location designated by three characters. Note that the length information in the position information may be omitted as long as it can be determined from the case contents.
- the case storage unit directly stores the case context text data shown in FIG. 8, but instead of the case context text data, the document in the document storage unit, Information for designating a part of text data such as a paragraph may be stored.
- the CPU functions as the new case generation unit 12 and sets conditions based on each case shown in FIG. Further, the CPU extracts information as a new case candidate from the plurality of documents shown in FIG. 7 stored in the document storage unit as a new case according to the set condition. Then, the CPU generates a new case context using peripheral text data including the extracted new case, and stores the generated new case context in the new case storage unit.
- the CPU generates a new case context using text data different from the case context as the text data used for generating the new case context. For example, the CPU can make a determination based on differences in character strings and morphemes around the corresponding part of the new case, and differences in sentences including the corresponding part of the new case.
- FIG. 9 is an explanatory diagram showing an example of data of a new case and a new case context.
- the CPU includes a new case ID that is an identifier for identifying a new case, new case context text data that is an entity of a new case context that includes the new case, and the new case context text data.
- the new case storage unit stores the position information indicating the corresponding part of the new case in the form associated with the type of the new case.
- the CPU may also associate the new case contents that are locations in the text data corresponding to the new case and store them in the new case storage unit.
- the type of new case may be the same as the type of case.
- the CPU may use, for example, information having the same character string as the case content as the condition based on the case. Specifically, when the case ID shown in FIG. 8 is generated based on the case corresponding to “EX1”, the CPU includes the character string “ ⁇ ⁇ ⁇ ” which is the case content corresponding to the case ID. Extract the location to make a new case. Then, the CPU sets the surrounding text data including the new case as a new case context. Note that the CPU may use the entire document including the new case as a new case context.
- the CPU may use morpheme sequence information of the case contents as a condition based on the case. For example, the CPU extracts a morpheme string corresponding to the case content from the morphological analysis result of the case context text data. Next, the CPU, on the condition that it has a morpheme sequence having a feature value of the same combination pattern as a predetermined combination pattern of feature values such as a prototype, part of speech, and thesaurus information among the features of each morpheme of the morpheme sequence The relevant part is extracted from the data as a new case.
- the CPU may use a method of generating a new case context by extracting text data around a corresponding part of the new case by a predetermined method as a method of generating the new case context.
- the CPU may set text data specified by a predetermined number of characters, the number of morphemes, the number of sentences, the number of paragraphs, etc. before and after the corresponding part of the new case as the new case context.
- the CPU determines the window width from a corresponding number of the new case by a predetermined number of characters, the number of morphemes, the number of sentences, the number of paragraphs, etc., and the text data in the window width including the corresponding point of the new case is determined. It may be a new case context.
- case context text data is not directly stored in the case context data, but instead of the case context text data, the case context is received by a method of storing information specifying the document ID in the document data. Good.
- the CPU extracts a new case context from a location different from the location indicated by the position information of the document ID specified by the case context.
- the CPU functions as the similarity calculation unit 13, and refers to the case context stored in the case storage unit and the new case context stored in the new case storage unit. Calculate similarity.
- the CPU functions as the similarity calculation unit 13 and calculates the pattern dissimilarity between the partial data in the case context and the partial data in the new case context in addition to the similarity.
- the CPU may calculate the similarity between the case context and the new case context, for example, by calculating the cosine similarity between the context vectors. That is, the CPU generates a context vector that expresses a context from text data of a case context and a new case context. Then, the CPU may calculate the cosine value of the angle formed between the context vectors to be calculated, and set the obtained cosine value as the similarity between the contexts.
- the CPU for example, divides text in each context into morphemes by morphological analysis, extracts words such as independent words and feature values of the morphemes, and sets them as vector elements. You may use the method of producing
- the similarity calculation method may be calculated by using a method in which the similarity calculation method between context vectors is devised to improve accuracy, as described in, for example, Japanese Patent No. 3690216. Well, it is not limited to the similarity calculation method shown in this embodiment.
- the CPU sets the context group to be calculated as a context group including a case context and all new case contexts generated based on the case. You may calculate. The reason for this is that, since it is limited to new cases generated from the same case, calculation can be performed without unnecessary context, and accuracy can be improved.
- the CPU may configure a vector space in the document group limited as described above to form a context vector. By doing so, for example, it can be suppressed that the idf value used for the weight is set inappropriately high, and an improvement in the accuracy of cosine similarity between contexts can be expected.
- the CPU may calculate the similarity by giving a high weight to the context vector of each new case generated based on the same case.
- the CPU when there are a plurality of cases of the same type, the CPU creates a context group including each case context and all new case contexts generated from each case.
- the degree of similarity may be calculated in a limited manner. For example, the CPU may configure a vector space in the context group thus limited to form a context vector.
- the reason is that the new case context generated based on the same type of case context is likely to have a similar context, and thus the vector elements can be counted appropriately. By doing so, for example, the idf value used for the weight can be appropriately set, and the accuracy of the similarity calculated can be expected to be improved.
- the CPU when there are a plurality of cases of the same type in calculating the similarity, the CPU includes a context including each case context and all new case contexts generated from each case.
- the degree of similarity may be calculated between a new case context and all case contexts in the context group.
- the CPU may use a method of making the maximum value of the similarities similar to a certain new case context.
- the CPU may use a value (multiplication value) obtained by multiplying the similarities of a new case context as the similarity of the new case context.
- the CPU can use the edit distance between the partial data in the case context and the partial data in the new case context.
- the partial data in the case context is a local character string surrounding the case in the case context
- the partial data in the new case context is the local character string around the new case in the new case.
- the edit distance between each string can be used.
- a local character string is a character string having a predetermined length that is shorter than the length of each context.
- the case context and the new case context are composed of sentences composed of a plurality of sentences, it may be within 5 characters before and after the character string corresponding to each case in each context.
- a restriction such as within the same sentence may be added, for example, within 5 characters before and after the character string corresponding to each case.
- the partial data in the case context is a local morpheme sequence around the case including the case in the case context, and the partial local morpheme including the new case in the new case context is used as the partial data in the new case context.
- the edit distance between each morpheme column can be used.
- the edit distance between morpheme strings is the number of operations required to change to the same morpheme string by performing operations such as insertion, deletion, and replacement for each morpheme, similar to the edit distance between character strings. Can be counted.
- the local morpheme sequence is a morpheme sequence having a predetermined length that is shorter than the length of each context.
- the morpheme string may be within 3 morphemes before and after the morpheme string corresponding to each case in each context.
- a restriction such as in the same sentence may be added, for example, within three morphemes before and after the morpheme string corresponding to each case.
- each feature of the morpheme may be added to the unit of editing.
- the partial data in the case context is a subtree that includes the case in the case context parsing result
- the partial data in the new case context is the subtree that includes the new case in the new case context parsing result.
- the edit distance between each sub-tree can be used.
- the edit distance between subtrees can be obtained by counting the number of operations required to change the structure of the same subtree by performing operations such as insertion, deletion, and replacement for each node in the subtree.
- the CPU functions as a new case narrowing unit 14 and narrows down new cases based on the calculated similarity. For example, since the similarity is calculated for each new case context, the CPU may arrange the new case contexts in descending order of similarity and narrow down a predetermined number of new cases from the top. Further, the CPU may narrow down the new cases corresponding to the new case context exceeding the predetermined similarity and output the new case narrowing results.
- the CPU functions as a new case narrowing unit 14 and narrows down new cases based on the calculated similarity and pattern difference. For example, since the similarity and the pattern difference are calculated for each new case context, the CPU may narrow down a predetermined number of new cases from the top by arranging the new case contexts in descending order of the similarity and the pattern difference. Alternatively, the new case contexts may be arranged in descending order of the value obtained by multiplying the calculated similarity by the pattern difference degree, and a predetermined number of new cases may be narrowed down from the top.
- the CPU may output (for example, display on a display device) using, for example, the format shown in FIG.
- FIG. 10 shows a case where the new case context and the new case context shown in FIG. 9 are output in the same format, and the narrowed down new case context is used as the extraction result.
- the CPU may add the calculated similarity to the new case extraction result and output it.
- the similarity calculated corresponding to the narrowed-down new case is also added and output.
- the degree of pattern difference may be added and output.
- all new cases including new cases excluded by the narrowing process are output.
- An output format in which a flag indicating that is output may be used.
- the new case generation device generates a new case context that is different from the case context based on the input case, and determines the similarity between the case context and the generated new case context. calculate. By doing so, new cases are narrowed down based on the similarity. Since it is configured as described above, it is possible to accurately generate a new case having the same kind of case and a context different from the case context.
- the new case generation device generates a new case context that is different from the case context based on the input case, and calculates the similarity between the case context and the generated new case context. Further, the new case generation device calculates the degree of pattern difference between the partial data in the case context and the partial data in the new case context. By doing so, new cases are narrowed down based on the similarity and the degree of pattern difference. Since it is configured as described above, it is possible to accurately generate a new case having the same kind of case and a context different from the case context.
- the configuration of the new case generation device is the same as the configuration shown in the first embodiment.
- This embodiment is different from the first embodiment in that the CPU also functions as the extraction rule application unit 15 by operating a computer as a new case generation device according to program control.
- the CPU functions as the data input unit 11A and accepts as an input an information extraction rule for extracting specific information.
- the information extraction rule may be configured by a dictionary including information to be extracted, a known pattern matching rule combining a plurality of features such as a character string, a morpheme string, and a syntax subtree.
- the CPU prepares and inputs these information as information extraction rules in advance.
- the CPU functions as the extraction rule application unit 15 and extracts information by applying the information extraction rule input by the data input unit 11A to the document stored in the document storage unit. Further, the CPU extracts the extracted information as a case, extracts a document including the information (case) as a case context, and stores it in the case storage unit.
- the CPU stores the case context extracted in the same format as the case storage format shown in FIG. 8 as the case context format to be stored.
- the information extraction rules are not limited to those shown in this embodiment.
- the information extraction rule may be prepared as extraction model data obtained as a result of learning information to be extracted in advance by various known machine learning techniques.
- the extraction rule application unit 15 realized by the CPU may extract the extraction result by applying the extracted model data as an information extraction rule to the extraction target document.
- the operations of the CPU functioning as the new case generation unit 12, the similarity calculation unit 13, and the new case narrowing unit 14 are the same as those operations described in the first embodiment.
- the new case generation apparatus extracts the case context from the extracted information by applying the information extraction rule to the document. Further, the new case generation device generates a new case context different from the case context based on the case, and calculates a topic similarity between the case context and the new case context. And by doing so, it narrows down to the new case with high similarity. Since it is configured as described above, it is possible to accurately generate a new case having the same kind of information extracted according to the input information extraction rule and having a context different from the case context.
- the configuration of the new case generation apparatus is the same as the configuration shown in the second embodiment.
- This embodiment is different from the second embodiment in that the CPU also functions as the extraction rule generation unit 16 by operating a computer as a new case generation apparatus according to program control.
- the CPU when the CPU functions as the new case narrowing unit 14A, the CPU uses the RAM or the like as a buffer to store the narrowed new cases as a narrowing result.
- the CPU when the CPU functions as the extraction rule generation unit 16, the CPU reads and receives the narrowing result from the buffer. Note that the CPU may use a method in which the result of narrowing down new cases is output once to an external storage device and then read.
- the CPU functions as the extraction rule generation unit 16 and generates a new information extraction rule using the extraction result that is the result of the new case narrowing unit 14 narrowing down.
- the CPU as a method for generating the information extraction rule, for example, by using a method of obtaining the corresponding text and new case and type from the data of the new case context of the narrowing result if it is a pattern matching rule, An information extraction rule can be generated by a known method.
- the CPU outputs the new case that was not adopted (excluded by narrowing down) to the extraction rule generation unit 16 when the new case narrowing unit 14 narrowed down the new case. You may make it do.
- generation part 16 can also produce
- the extraction rule generation unit 16 generates a new information extraction rule using the extraction result of the new case narrowing unit 14A. Because it is configured in this way, it can extract not only new information of the same type as the information extracted by the first input information extraction rule but also the same type of information extracted by the first input information extraction rule. New information extraction rules can be obtained.
- FIG. 11 is a configuration diagram illustrating a minimum configuration example of the new case generation apparatus.
- the new case generation apparatus includes a new case generation unit 12, a similarity calculation unit 13, and a new case narrowing unit 14 as the minimum components.
- the new case generation apparatus shown in FIG. 11 generates a new case of the same type as the case as a new case based on the case of information to be extracted.
- the new case generation unit 12 receives a case and a case context that is surrounding text data including the case as an input, and based on the input case and the case context A new case that is a new case of the same type as the case and a new case context that includes the new case and that is different from the case context using the document data.
- the similarity calculation unit 13 has a function of calculating the similarity between the case context and the new case context.
- the new case narrowing unit 14 has a function of narrowing and outputting the new cases generated by the new case generation unit 12 based on the similarity calculated by the similarity calculation unit 13.
- the new case generation apparatus with the minimum configuration shown in FIG. 11 can accurately generate a new case of the same type as the case of information to be extracted.
- the new case generation apparatus accepts, as input, a case of information to be extracted and a case context that is surrounding text data including the case.
- New case generation means for example, new case generation
- New case generation for generating a new case that is a new case, and surrounding text data including the new case, and a new case context that is different from the case context using document data
- the similarity calculation unit for example, realized by the similarity calculation unit 13
- a similarity calculation unit that calculates the similarity between the case context and the new case context.
- New case narrowing means for example, realized by the new case narrowing unit 14 for narrowing and outputting the new cases generated by the new case creating means based on the similarity. And butterflies.
- the new case generation apparatus accepts an information extraction rule for extracting specific information as an input, and uses an input information extraction rule to extract a predetermined extraction result from document data (for example, extraction rule application means)
- the new case generation means comprises a new result of the same type as the case based on the case of the information to be extracted, which is composed of the extraction results extracted by the extraction rule application means.
- a new case which is a case and surrounding text data including the new case and a new case context different from the case context may be generated using document data.
- the new case generation means generates a new case having the same character string as the character string corresponding to the case and text data different from the case context of the case as a new case context.
- the document data may be used for generation.
- the new case generation means generates text data having a morpheme sequence pattern identical to a predetermined pattern of the morpheme sequence corresponding to the case and different from the case context of the case.
- the new case may be generated using document data.
- the new case generation means uses at least one of a predetermined number of character strings, morphemes, sentences, or paragraphs existing around the new case as the new case context. You may be comprised so that the text data containing may be produced
- the similarity calculation means includes a case context vector corresponding to the case context and a new case context vector corresponding to the new case context in the vector space generated based on the case context and the new case context. May be configured to calculate the similarity between the case context and the new case context.
- the similarity calculation means includes, as a vector space, a vector space generated based on a case context of a case and a set of all new case contexts generated based on the case.
- the degree of similarity between the case context vector corresponding to the case context and the new case context vector corresponding to the new case context may be calculated.
- the similarity calculation means is based on a set of case contexts of cases of a certain case type and a set of all new case contexts generated based on any case as a vector space. In the generated vector space, the similarity between the case context vector corresponding to the case context and the new case context vector corresponding to the new case context may be calculated.
- the new case generation apparatus accepts an information extraction rule for extracting specific information as an input, and an extraction rule application unit (for example, extracts a predetermined extraction result from document data using the input information extraction rule)
- the new case generation means includes an example of information to be extracted and surrounding text data including the case, which are constituted by extraction results extracted by the extraction rule application means.
- Information extraction rule generation means (for example, an extraction rule generation unit) that generates a new information extraction rule based on the new case generated by the new case narrowing means To) may be configured to further include a realized by 6.
- the extraction rule application unit accepts the information extraction rule generated by the information extraction rule generation unit as a new input, and performs predetermined extraction from the document data using the newly input information extraction rule It may be configured to extract results.
- the similarity calculation means calculates the pattern dissimilarity between the data that is part of the case context and the data that is part of the new case context, and the similarity calculation means calculates And a new case narrowing means (for example, realized by the new case narrowing section 14) for narrowing and outputting the new cases generated by the new case generating means based on the similarity and the pattern difference degree. May be.
- the new case generation apparatus accepts as input the case of the information to be extracted and the case context that is surrounding text data including the case, and based on the input case and the case context, A new case generation unit (for example, a new case generation) that generates a new case that is a new case and surrounding text data including the new case and a new case context that is different from the case context using document data Calculated by the unit 12), a similarity calculation unit for calculating the similarity between the case context and the new case context (for example, realized by the similarity calculation unit 13), and a similarity calculation unit A new case narrowing unit (for example, realized by the new case narrowing unit 14) that narrows and outputs a new case generated by the new case generation unit based on the similarity is provided. .
- a new case generation unit for example, a new case generation
- the new case generation device accepts an information extraction rule for extracting specific information as an input, and extracts a predetermined extraction result from document data using the input information extraction rule (for example,
- the new case generation unit is configured with a new result of the same type as the case based on the case of the information to be extracted, which is composed of the extraction results extracted by the extraction rule application unit.
- a new case which is a case and surrounding text data including the new case and a new case context different from the case context may be generated using document data.
- the new case generation unit In the new case generation device, the new case generation unit generates a new case that has the same character string as the character string corresponding to the case and uses text data different from the case context of the case as a new case context.
- the document data may be used for generation.
- the new case generation unit generates text data having a morpheme sequence pattern identical to a predetermined pattern of the morpheme sequence corresponding to the case and different from the case context of the case.
- the new case may be generated using document data.
- the new case generation unit uses, as a new case context, at least one of a predetermined number of character strings, morphemes, sentences, or paragraphs existing around the new case. You may be comprised so that the text data containing may be produced
- the similarity calculation unit includes a case context vector corresponding to the case context and a new case context vector corresponding to the new case context in the vector space generated based on the case context and the new case context. May be configured to calculate the similarity between the case context and the new case context.
- the similarity calculation unit includes, as a vector space, a vector space generated based on a case context of a case and a set of all new case contexts generated based on the case.
- the degree of similarity between the case context vector corresponding to the case context and the new case context vector corresponding to the new case context may be calculated.
- the similarity calculation unit is based on a set of case contexts of cases of a certain case type and a set of all new case contexts generated based on any case as a vector space. In the generated vector space, the similarity between the case context vector corresponding to the case context and the new case context vector corresponding to the new case context may be calculated.
- the new case generation device accepts an information extraction rule for extracting specific information as an input, and extracts a predetermined extraction result from document data using the input information extraction rule (for example,
- the new case generation unit is composed of an extraction result extracted by the extraction rule application unit and a text data around the case including the case.
- An information extraction rule generation unit (for example, by the extraction rule generation unit 16) that generates a new information extraction rule based on the new case generated by the new case narrowing unit To) it may be configured to further comprise a realization Te.
- the extraction rule application unit accepts the information extraction rule generated by the information extraction rule generation unit as a new input, and performs predetermined extraction from the document data using the newly input information extraction rule It may be configured to extract results.
- the similarity calculation unit calculates the pattern dissimilarity between the data that is a part of the case context and the data that is a part of the new case context, and the similarity calculation unit calculates And a new case narrowing unit (for example, realized by the new case narrowing unit 14) that narrows and outputs a new case generated by the new case generation unit based on the similarity and the pattern difference degree. May be.
- the present invention can be applied to the use of an information extraction rule generation device that generates a new case of the same type as the case based on the input case. Further, the present invention can be applied to the use of a program for realizing the information extraction rule generation device using a computer. Further, the present invention can be applied to the use of an information search device that performs keyword search, and a question answer search device that performs a question answer search that searches for an answer that matches a question in natural language. In this case, the new case generation method according to the present invention can be used for application such as query expansion in which keywords and questions are expanded. The present invention can also be applied to the use of a program for causing a computer to implement an information retrieval device and a program for causing a computer to implement a question / answer retrieval device.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
12 新規事例生成部
13 類似度計算部
14,14A 新規事例絞込部
15 抽出規則適用部
16 抽出規則生成部
以下、本発明の第1の実施形態について図面を参照して説明する。図1は、本発明による新規事例生成装置の構成の一例を示すブロック図である。図1に示すように、新規事例生成装置は、データ入力部11、新規事例生成部12、類似度計算部13、及び新規事例絞込部14を含む。
次に、本発明の第2の実施形態について図面を参照して説明する。図3は、第2の実施形態における新規事例生成装置の構成例を示すブロック図である。図3に示すように、新規事例生成装置は、データ入力部11A、抽出規則適用部15、新規事例生成部12、類似度計算部13、及び新規事例絞込部14を含む。
次に、本発明の第3の実施形態について図面を参照して説明する。図5は、第3の実施形態における新規事例生成装置の構成例を示すブロック図である。図5に示すように、本実施形態では、新規事例生成装置が、図3に示した構成要素に加えて抽出規則生成部16を含む点で、第2の実施形態と異なる。また、本実施形態では、新規事例絞込部14Aの機能が、第2の実施形態で示した新規事例絞込部14の機能と異なる。
Claims (33)
- 抽出したい情報の事例と、当該事例を含む周辺のテキストデータである事例文脈とを入力として受け付けて、入力した前記事例及び前記事例文脈に基づいて、当該事例と同種の新たな事例である新規事例と、当該新規事例を含む周辺のテキストデータであって前記事例文脈とは異なる新規事例文脈とを、文書データを用いて生成する新規事例生成手段と、
前記事例文脈と前記新規事例文脈との間の類似度を計算する類似度計算手段と、
前記類似度計算手段が計算した類似度に基づいて、前記新規事例生成手段が生成した前記新規事例を絞込み出力する新規事例絞込手段とを
備えたことを特徴とする新規事例生成装置。 - 特定の情報を抽出するための情報抽出規則を入力として受け付けて、入力した前記情報抽出規則を用いて文書データから所定の抽出結果を抽出する抽出規則適用手段を備え、
前記新規事例生成手段は、前記抽出規則適用手段が抽出した前記抽出結果で構成される、抽出したい情報の事例に基づいて、当該事例と同種の新たな事例である新規事例と、当該新規事例を含む周辺のテキストデータであって前記事例文脈とは異なる新規事例文脈とを、文書データを用いて生成する
請求項1記載の新規事例生成装置。 - 前記新規事例生成手段は、前記事例に該当する文字列と同一の文字列を有し、かつ前記事例の事例文脈とは異なるテキストデータを新規事例文脈とする新規事例を、文書データを用いて生成する請求項1又は請求項2記載の新規事例生成装置。
- 前記新規事例生成手段は、前記事例に該当する形態素列の所定のパターンと同一の形態素列パターンを有し、かつ前記事例の事例文脈とは異なるテキストデータを新規事例文脈とする新規事例を、文書データを用いて生成する請求項1又は請求項2記載の新規事例生成装置。
- 前記新規事例生成手段は、前記新規事例文脈として、前記新規事例の周辺に存在する所定数の文字列数、形態素数、文数、又は段落数のうちの少なくともいずれかを含むテキストデータを生成する請求項1から請求項4のうちのいずれか1項に記載の新規事例生成装置。
- 前記類似度計算手段は、事例文脈及び新規事例文脈に基づいて生成したベクトル空間において、前記事例文脈に対応する事例文脈ベクトルと、前記新規事例文脈に対応する新規事例文脈ベクトルとの間の類似度を計算することによって、前記事例文脈と前記新規事例文脈との間の類似度を計算する請求項1から請求項5のうちのいずれか1項に記載の新規事例生成装置。
- 前記類似度計算手段は、前記ベクトル空間として、ある事例の事例文脈と、当該事例に基づいて生成した全ての新規事例文脈の集合とに基づいて生成したベクトル空間において、前記事例文脈に対応する事例文脈ベクトルと、前記新規事例文脈に対応する新規事例文脈ベクトルとの間の類似度を計算する請求項6記載の新規事例生成装置。
- 前記類似度計算手段は、前記ベクトル空間として、ある事例種別の事例の事例文脈の集合と、いずれかの事例に基づいて生成した全ての新規事例文脈の集合とに基づいて生成したベクトル空間において、前記事例文脈に対応する事例文脈ベクトルと、前記新規事例文脈に対応する新規事例文脈ベクトルとの間の類似度を計算する請求項6記載の新規事例生成装置。
- 特定の情報を抽出するための情報抽出規則を入力として受け付けて、入力した前記情報抽出規則を用いて文書データから所定の抽出結果を抽出する抽出規則適用手段を備え、
前記新規事例生成手段は、前記抽出規則適用手段が抽出した前記抽出結果で構成される、抽出したい情報の事例と、当該事例を含む周辺のテキストデータである事例文脈とを入力として受け付けて、当該事例と同種の新たな事例である新規事例と、当該新規事例を含む周辺のテキストデータであって前記事例文脈とは異なる新規事例文脈とを、文書データを用いて生成し、
前記新規事例絞込手段が出力した新規事例に基づいて、新たな情報抽出規則を生成する情報抽出規則生成手段をさらに備えた
請求項1記載の新規事例生成装置。 - 前記抽出規則適用手段は、前記情報抽出規則生成手段が生成した情報抽出規則を新たな入力として受け付けて、新たに入力した前記情報抽出規則を用いて文書データから所定の抽出結果を抽出する請求項9記載の新規事例生成装置。
- 前記類似度計算手段は、前記事例文脈中の一部分であるデータと前記新規事例文脈中の一部分であるデータとの間のパターン異なり度を計算し、
前記類似度計算手段が計算した類似度およびパターン異なり度に基づいて、前記新規事例生成手段が生成した前記新規事例を絞込み出力する新規事例絞込手段を備えた
請求項1記載の新規事例生成装置。 - 抽出したい情報の事例と、当該事例を含む周辺のテキストデータである事例文脈とを入力として受け付けて、入力した前記事例及び前記事例文脈に基づいて、当該事例と同種の新たな事例である新規事例と、当該新規事例を含む周辺のテキストデータであって前記事例文脈とは異なる新規事例文脈とを、文書データを用いて生成し、
前記事例文脈と前記新規事例文脈との間の類似度を計算し、
前記計算した類似度に基づいて、前記生成した前記新規事例を絞込み出力する
ことを特徴とする新規事例生成方法。 - 特定の情報を抽出するための情報抽出規則を入力として受け付けて、入力した前記情報抽出規則を用いて文書データから所定の抽出結果を抽出し、
前記抽出した前記抽出結果で構成される、抽出したい情報の事例に基づいて、当該事例と同種の新たな事例である新規事例と、当該新規事例を含む周辺のテキストデータであって前記事例文脈とは異なる新規事例文脈とを、文書データを用いて生成する
請求項12記載の新規事例生成方法。 - 前記事例に該当する文字列と同一の文字列を有し、かつ前記事例の事例文脈とは異なるテキストデータを新規事例文脈とする新規事例を、文書データを用いて生成する請求項12又は請求項13記載の新規事例生成方法。
- 前記事例に該当する形態素列の所定のパターンと同一の形態素列パターンを有し、かつ前記事例の事例文脈とは異なるテキストデータを新規事例文脈とする新規事例を、文書データを用いて生成する請求項12又は請求項13記載の新規事例生成方法。
- 前記新規事例文脈として、前記新規事例の周辺に存在する所定数の文字列数、形態素数、文数、又は段落数のうちの少なくともいずれかを含むテキストデータを生成する請求項12から請求項15のうちのいずれか1項に記載の新規事例生成方法。
- 事例文脈及び新規事例文脈に基づいて生成したベクトル空間において、前記事例文脈に対応する事例文脈ベクトルと、前記新規事例文脈に対応する新規事例文脈ベクトルとの間の類似度を計算することによって、前記事例文脈と前記新規事例文脈との間の類似度を計算する請求項12から請求項16のうちのいずれか1項に記載の新規事例生成方法。
- 前記ベクトル空間として、ある事例の事例文脈と、当該事例に基づいて生成した全ての新規事例文脈の集合とに基づいて生成したベクトル空間において、前記事例文脈に対応する事例文脈ベクトルと、前記新規事例文脈に対応する新規事例文脈ベクトルとの間の類似度を計算する請求項17記載の新規事例生成方法。
- 前記ベクトル空間として、ある事例種別の事例の事例文脈の集合と、いずれかの事例に基づいて生成した全ての新規事例文脈の集合とに基づいて生成したベクトル空間において、前記事例文脈に対応する事例文脈ベクトルと、前記新規事例文脈に対応する新規事例文脈ベクトルとの間の類似度を計算する請求項17記載の新規事例生成方法。
- 特定の情報を抽出するための情報抽出規則を入力として受け付けて、入力した前記情報抽出規則を用いて文書データから所定の抽出結果を抽出し、
前記抽出した前記抽出結果で構成される、抽出したい情報の事例と、当該事例を含む周辺のテキストデータである事例文脈とを入力として受け付けて、当該事例と同種の新たな事例である新規事例と、当該新規事例を含む周辺のテキストデータであって前記事例文脈とは異なる新規事例文脈とを、文書データを用いて生成し、
前記新規事例の絞り込み結果として出力した新規事例に基づいて、新たな情報抽出規則を生成する
請求項12記載の新規事例生成方法。 - 前記生成した情報抽出規則を新たな入力として受け付けて、新たに入力した前記情報抽出規則を用いて文書データから所定の抽出結果を抽出する請求項20記載の新規事例生成方法。
- 前記事例文脈中の一部分であるデータと前記新規事例文脈中の一部分であるデータとの間のパターン異なり度を計算し、
前記計算した類似度およびパターン異なり度に基づいて、前記生成した前記新規事例を絞込み出力する
請求項12記載の新規事例生成方法。 - コンピュータに、
抽出したい情報の事例と、当該事例を含む周辺のテキストデータである事例文脈とを入力として受け付けて、入力した前記事例及び前記事例文脈に基づいて、当該事例と同種の新たな事例である新規事例と、当該新規事例を含む周辺のテキストデータであって前記事例文脈とは異なる新規事例文脈とを、文書データを用いて生成する新規事例生成処理と、
前記事例文脈と前記新規事例文脈との間の類似度を計算する類似度計算処理と、
前記計算した類似度に基づいて、前記生成した前記新規事例を絞込み出力する新規事例絞込処理とを
実行させるための新規事例生成用プログラム。 - コンピュータに、
特定の情報を抽出するための情報抽出規則を入力として受け付けて、入力した前記情報抽出規則を用いて文書データから所定の抽出結果を抽出する抽出規則適用処理を実行させ、
前記新規事例生成処理で、前記抽出した前記抽出結果で構成される、抽出したい情報の事例に基づいて、当該事例と同種の新たな事例である新規事例と、当該新規事例を含む周辺のテキストデータであって前記事例文脈とは異なる新規事例文脈とを、文書データを用いて生成する処理を実行させる
請求項23記載の新規事例生成用プログラム。 - コンピュータに、
前記新規事例生成処理で、前記事例に該当する文字列と同一の文字列を有し、かつ前記事例の事例文脈とは異なるテキストデータを新規事例文脈とする新規事例を、文書データを用いて生成する処理を実行させる
請求項23又は請求項24記載の新規事例生成用プログラム。 - コンピュータに、
前記新規事例生成処理で、前記事例に該当する形態素列の所定のパターンと同一の形態素列パターンを有し、かつ前記事例の事例文脈とは異なるテキストデータを新規事例文脈とする新規事例を、文書データを用いて生成する処理を実行させる
請求項23又は請求項24記載の新規事例生成用プログラム。 - コンピュータに、
前記新規事例生成処理で、前記新規事例文脈として、前記新規事例の周辺に存在する所定数の文字列数、形態素数、文数、又は段落数のうちの少なくともいずれかを含むテキストデータを生成する処理を実行させる
請求項23から請求項26のうちのいずれか1項に記載の新規事例生成用プログラム。 - コンピュータに、
前記類似度計算処理で、事例文脈及び新規事例文脈に基づいて生成したベクトル空間において、前記事例文脈に対応する事例文脈ベクトルと、前記新規事例文脈に対応する新規事例文脈ベクトルとの間の類似度を計算することによって、前記事例文脈と前記新規事例文脈との間の類似度を計算する処理を実行させる
請求項23から請求項27のうちのいずれか1項に記載の新規事例生成用プログラム。 - コンピュータに、
前記類似度計算処理で、前記ベクトル空間として、ある事例の事例文脈と、当該事例に基づいて生成した全ての新規事例文脈の集合とに基づいて生成したベクトル空間において、前記事例文脈に対応する事例文脈ベクトルと、前記新規事例文脈に対応する新規事例文脈ベクトルとの間の類似度を計算する処理を実行させる
請求項28記載の新規事例生成用プログラム。 - コンピュータに、
前記類似度計算処理で、前記ベクトル空間として、ある事例種別の事例の事例文脈の集合と、いずれかの事例に基づいて生成した全ての新規事例文脈の集合とに基づいて生成したベクトル空間において、前記事例文脈に対応する事例文脈ベクトルと、前記新規事例文脈に対応する新規事例文脈ベクトルとの間の類似度を計算する処理を実行させる
請求項28記載の新規事例生成用プログラム。 - コンピュータに、
特定の情報を抽出するための情報抽出規則を入力として受け付けて、入力した前記情報抽出規則を用いて文書データから所定の抽出結果を抽出する抽出規則適用処理を実行させ、
前記新規事例生成処理で、前記抽出した前記抽出結果で構成される、抽出したい情報の事例と、当該事例を含む周辺のテキストデータである事例文脈とを入力として受け付けて、当該事例と同種の新たな事例である新規事例と、当該新規事例を含む周辺のテキストデータであって前記事例文脈とは異なる新規事例文脈とを、文書データを用いて生成する処理を実行させ、
前記新規事例の絞り込み結果として出力した新規事例に基づいて、新たな情報抽出規則を生成する情報抽出規則生成処理をさらに実行させる
請求項23記載の新規事例生成用プログラム。 - コンピュータに、
前記抽出規則適用処理で、前記生成した情報抽出規則を新たな入力として受け付けて、新たに入力した前記情報抽出規則を用いて文書データから所定の抽出結果を抽出する処理を実行させる
請求項31記載の新規事例生成用プログラム。 - コンピュータに、
前記新規事例生成処理で、前記事例文脈中の一部分であるデータと前記新規事例文脈中の一部分であるデータとの間のパターン異なり度を計算する処理を実行させ、
前記新規事例絞込処理で、
前記計算した類似度およびパターン異なり度に基づいて、前記生成した前記新規事例を絞込み出力する処理を実行させる
請求項23記載の新規事例生成用プログラム。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US12/922,396 US20110106849A1 (en) | 2008-03-12 | 2009-03-09 | New case generation device, new case generation method, and new case generation program |
| JP2010502718A JP5447368B2 (ja) | 2008-03-12 | 2009-03-09 | 新規事例生成装置、新規事例生成方法及び新規事例生成用プログラム |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2008-062610 | 2008-03-12 | ||
| JP2008062610 | 2008-03-12 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2009113289A1 true WO2009113289A1 (ja) | 2009-09-17 |
Family
ID=41064963
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2009/001046 Ceased WO2009113289A1 (ja) | 2008-03-12 | 2009-03-09 | 新規事例生成装置、新規事例生成方法及び新規事例生成用プログラム |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US20110106849A1 (ja) |
| JP (1) | JP5447368B2 (ja) |
| WO (1) | WO2009113289A1 (ja) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2011099355A1 (ja) * | 2010-02-12 | 2011-08-18 | 日本電気株式会社 | 文書分析装置、文書分析方法、およびコンピュータ読み取り可能な記録媒体 |
| JP2014199475A (ja) * | 2013-03-29 | 2014-10-23 | 株式会社エヌ・ティ・ティ・データ | 言語表現抽出装置、言語表現抽出方法およびプログラム |
| WO2023175954A1 (ja) * | 2022-03-18 | 2023-09-21 | 日本電気株式会社 | 情報処理装置、情報処理方法、及びコンピュータ読み取り可能な記録媒体 |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP6433468B2 (ja) * | 2016-09-28 | 2018-12-05 | 本田技研工業株式会社 | プログラム作成支援方法 |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2001034630A (ja) * | 1999-07-22 | 2001-02-09 | Fujitsu Ltd | 文書ベース検索システム、およびその方法 |
| JP2002132812A (ja) * | 2000-10-19 | 2002-05-10 | Nippon Telegr & Teleph Corp <Ntt> | 質問応答方法、質問応答システム及び質問応答プログラムを記録した記録媒体 |
| JP2003271669A (ja) * | 2002-03-15 | 2003-09-26 | Fujitsu Ltd | 話題抽出装置 |
| WO2006085661A1 (ja) * | 2005-02-08 | 2006-08-17 | Nec Corporation | 質問応答データ編集装置、質問応答データ編集方法、質問応答データ編集プログラム |
Family Cites Families (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20020133380A1 (en) * | 1998-02-19 | 2002-09-19 | Masataka Okayama | Portable information terminal surrounding formulation of an optimum plan |
| WO2002063493A1 (en) * | 2001-02-08 | 2002-08-15 | 2028, Inc. | Methods and systems for automated semantic knowledge leveraging graph theoretic analysis and the inherent structure of communication |
| JP4162223B2 (ja) * | 2003-05-30 | 2008-10-08 | 日本電信電話株式会社 | 自然文検索装置、その方法及びプログラム |
| US20050278623A1 (en) * | 2004-05-17 | 2005-12-15 | Dehlinger Peter J | Code, system, and method for generating documents |
| TW200807263A (en) * | 2006-07-19 | 2008-02-01 | Benq Corp | Document editing systems and methods |
| JP4997966B2 (ja) * | 2006-12-28 | 2012-08-15 | 富士通株式会社 | 対訳例文検索プログラム、対訳例文検索装置、および対訳例文検索方法 |
| US7937389B2 (en) * | 2007-11-01 | 2011-05-03 | Ut-Battelle, Llc | Dynamic reduction of dimensions of a document vector in a document search and retrieval system |
| JP2009169536A (ja) * | 2008-01-11 | 2009-07-30 | Ricoh Co Ltd | 情報処理装置、画像形成装置、ドキュメント生成方法、ドキュメント生成プログラム |
-
2009
- 2009-03-09 JP JP2010502718A patent/JP5447368B2/ja not_active Expired - Fee Related
- 2009-03-09 WO PCT/JP2009/001046 patent/WO2009113289A1/ja not_active Ceased
- 2009-03-09 US US12/922,396 patent/US20110106849A1/en not_active Abandoned
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2001034630A (ja) * | 1999-07-22 | 2001-02-09 | Fujitsu Ltd | 文書ベース検索システム、およびその方法 |
| JP2002132812A (ja) * | 2000-10-19 | 2002-05-10 | Nippon Telegr & Teleph Corp <Ntt> | 質問応答方法、質問応答システム及び質問応答プログラムを記録した記録媒体 |
| JP2003271669A (ja) * | 2002-03-15 | 2003-09-26 | Fujitsu Ltd | 話題抽出装置 |
| WO2006085661A1 (ja) * | 2005-02-08 | 2006-08-17 | Nec Corporation | 質問応答データ編集装置、質問応答データ編集方法、質問応答データ編集プログラム |
Non-Patent Citations (2)
| Title |
|---|
| KENJI KITA ET AL.: "Joho Kensaku Algorithm", 1 January 2002, KYORITSU SHUPPAN CO., LTD. * |
| MADOKA SATO ET AL.: "Netnews Kijigun no Jido Package-ka", TRANSACTIONS OF INFORMATION PROCESSING SOCIETY OF JAPAN, vol. 38, no. 6, 15 June 1997 (1997-06-15), pages 1225 - 1234 * |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2011099355A1 (ja) * | 2010-02-12 | 2011-08-18 | 日本電気株式会社 | 文書分析装置、文書分析方法、およびコンピュータ読み取り可能な記録媒体 |
| US9311392B2 (en) | 2010-02-12 | 2016-04-12 | Nec Corporation | Document analysis apparatus, document analysis method, and computer-readable recording medium |
| JP2014199475A (ja) * | 2013-03-29 | 2014-10-23 | 株式会社エヌ・ティ・ティ・データ | 言語表現抽出装置、言語表現抽出方法およびプログラム |
| WO2023175954A1 (ja) * | 2022-03-18 | 2023-09-21 | 日本電気株式会社 | 情報処理装置、情報処理方法、及びコンピュータ読み取り可能な記録媒体 |
| JPWO2023175954A1 (ja) * | 2022-03-18 | 2023-09-21 |
Also Published As
| Publication number | Publication date |
|---|---|
| US20110106849A1 (en) | 2011-05-05 |
| JP5447368B2 (ja) | 2014-03-19 |
| JPWO2009113289A1 (ja) | 2011-07-21 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP4656868B2 (ja) | 構造化文書作成装置 | |
| JP2007323671A (ja) | 中国語テキストにおける単語分割 | |
| JP2011118689A (ja) | 検索方法及びシステム | |
| JP2019082931A (ja) | 検索装置、類似度算出方法、およびプログラム | |
| JP2006065387A (ja) | テキスト文検索装置、テキスト文検索方法、及びテキスト文検索プログラム | |
| CN118395205B (zh) | 一种多模态跨语言检测方法及装置 | |
| JP5447368B2 (ja) | 新規事例生成装置、新規事例生成方法及び新規事例生成用プログラム | |
| JP4856573B2 (ja) | 要約文生成装置及び要約文生成プログラム | |
| US11842152B2 (en) | Sentence structure vectorization device, sentence structure vectorization method, and storage medium storing sentence structure vectorization program | |
| JP5169456B2 (ja) | 文書検索システム、文書検索方法および文書検索プログラム | |
| JP2004334382A (ja) | 構造化文書要約装置、プログラムおよび記録媒体 | |
| KR20170107808A (ko) | 원문문장을 번역 소단위들로 분할하고 소번역단위들의 번역어순을 결정하는 번역어순패턴 데이터 구조, 이를 생성하기 위한 명령어들을 저장한 컴퓨터 판독가능한 저장매체 및 이를 가지고 번역을 수행하는 컴퓨터 판독가능한 저장매체에 저장된 번역 프로그램 | |
| US12333245B2 (en) | Methods and apparatus to improve disambiguation and interpretation in automated text analysis using structured language space and transducers applied on automatons | |
| KR101835994B1 (ko) | 키워드 맵을 이용한 전자책 검색 서비스 제공 방법 및 장치 | |
| JP4985096B2 (ja) | 文書解析システム、および文書解析方法、並びにコンピュータ・プログラム | |
| JP7131130B2 (ja) | 分類方法、装置、及びプログラム | |
| JP4478042B2 (ja) | 頻度情報付き単語集合生成方法、プログラムおよびプログラム記憶媒体、ならびに、頻度情報付き単語集合生成装置、テキスト索引語作成装置、全文検索装置およびテキスト分類装置 | |
| JP4341077B2 (ja) | 文書処理装置、文書処理方法、および、文書処理プログラム | |
| JP2009176148A (ja) | 未知語判定システム、方法及びプログラム | |
| JP4148247B2 (ja) | 語彙獲得方法及び装置及びプログラム及びコンピュータ読み取り可能な記録媒体 | |
| JP2007164635A (ja) | 同義語彙獲得方法及び装置及びプログラム | |
| JP2000339342A (ja) | 文書検索方法および文書検索装置 | |
| JP2010122823A (ja) | テキスト処理システム、情報処理装置、テキストおよび情報の処理方法ならびに処理プログラム | |
| JP3939264B2 (ja) | 形態素解析装置 | |
| Bhowmik et al. | Development of A Word Based Spell Checker for Bangla Language |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09720295 Country of ref document: EP Kind code of ref document: A1 |
|
| DPE2 | Request for preliminary examination filed before expiration of 19th month from priority date (pct application filed from 20040101) | ||
| WWE | Wipo information: entry into national phase |
Ref document number: 2010502718 Country of ref document: JP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 12922396 Country of ref document: US |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 09720295 Country of ref document: EP Kind code of ref document: A1 |