CN120975082A

CN120975082A - A method for identifying focus words in natural language questions for intelligent question answering systems

Info

Publication number: CN120975082A
Application number: CN202510958085.2A
Authority: CN
Inventors: 胡新; 田诗琪; 段江丽; 任晓峰; 张素兰
Original assignee: Yangtze Normal University
Current assignee: Yangtze Normal University
Priority date: 2025-07-11
Filing date: 2025-07-11
Publication date: 2025-11-18

Abstract

The invention discloses a method for identifying focus words in natural language questions and sentences oriented to an intelligent question and answer system, and relates to the technical field of natural language questions and answers. The invention provides a method for identifying focus words in natural language questions, which enables a question-answering system to more accurately understand focus points of users, provides a prefix tree structure dominated by decision items, further introduces an algorithm for mining strong focus association rules to identify the focus words based on the prefix tree, is more efficient than a classical association rule mining algorithm Apriori, provides an inverted index for the strong focus association rules, further introduces an algorithm for identifying the focus words based on the inverted index, is more efficient than sequential search, and defines a focus item set, a frequent focus item set, a focus association rule and the strong focus association rule to better express information related to the focus words.

Description

Method for identifying focus words in natural language question sentence facing intelligent question-answering system

Technical Field

The invention relates to the technical field of natural language questions and answers, in particular to an intelligent question and answer system-oriented method for identifying focus words in natural language questions and sentences.

Background

In natural language questions and answers, the focus word is the core element that understands the user's intent and pinpoints the answer. The focus word is the key point of the user question, namely the vocabulary capable of locating the answer. For example, for question "Q1:Who created Goofy?"、"Q2:Which cities does the Weser flow through?"、"Q3:What is the longest river?" and "Q4: GIVE ME ALIST ofalllakes in Denmark", "their focus words are" Who "," cities "," river "and" lakes ", respectively. The accurate recognition of the focus word can not only avoid ambiguity and misunderstanding, but also remarkably improve the recall rate and the accuracy of the answer.

In the early stage of the study, in most cases, the existing natural language question-answering method is effective in answering questions by connecting keywords while ignoring focused word recognition. From the perspective of existing natural language question-answering methods, keywords in question sentences are the most valuable words required by the system to search answers, and these methods focus on the recognition, mapping and combination of keywords. This mainly includes two types of methods 1) understanding the question through the dependency relationship of the words in the question. The method decomposes question into a dependency syntax analysis tree, wherein each word in the question corresponds to a node to show logical relationships between words. 2) The understanding of the question is achieved through mapping entities and relationships. The method includes the steps of jointly mapping an entity and a relation in a question, wherein the entity is one end point of the relation, and then acquiring the other end point of the relation as an answer.

Along with the continuous deep research of natural language questions and answers, the focus words in the questions and sentences become a short board for further improving the natural language questions and answers performance. If the question-answering system wants to answer more questions or understand the focus of the user questions more accurately, the focused word becomes an unavoidable obstacle. Because when the system obtains a large amount of information related to questions, all previous efforts will be put into flow if the system is unable to accurately grasp the questions of the questioner. For example, for question "Q3: WHAT IS THE longest river", the existing methods may be confused with "What" and "river", and it is not certain which is the point of interest for the user. In addition, for the question "Q4: GIVE ME A LIST ofall lakes in Denmark," the existing method also recognizes, maps and combines the two words "Give" and "list," which can make the system more confused about the user's focus and difficult to locate the answer "lakes" that the user really cares about.

To sum up, in order to improve understanding ability of a natural language question, a method capable of identifying a focus word in the natural language question needs to be designed to eliminate confusion of a question-answering system on a user's attention point.

Disclosure of Invention

The invention aims to provide a method for identifying focus words in natural language questions and answers facing an intelligent question and answer system, which solves the problem that the focus word identification cannot be accurately understood by the existing method in the natural language questions and answers, and improves the understanding capability of the natural language questions and answers on the intention of a user by defining related concepts, proposing an algorithm of a prefix tree structure and mining association rules, designing an inverted index and an efficient focus word identification algorithm.

In order to achieve the purpose, the invention provides the following technical scheme that the method for identifying the focus word in the natural language question aiming at the intelligent question-answering system at least comprises the following steps:

s1, firstly, a question decision information table is provided, and a foundation is provided for the subsequent steps;

s2, defining related concepts, wherein the related concepts at least comprise a focus item set, a frequent focus item set, a focus association rule and a strong focus association rule;

s3, converting the question decision information table into transaction data taking the item as a basic unit;

s4, constructing an algorithm for mining a strong focus association rule, namely MSFAR algorithm;

S5, constructing an inverted index for the strong focus association rule;

s6, constructing an identification algorithm of focus words based on inverted indexes, and completing specific focus word identification.

Further, the question decision information table is set as:

T=<U,A=C∪D,V,f>

Wherein U is a finite field composed of question sentences, C is a conditional attribute set comprising question sentence type, question sentence structure words, part of speech of the question sentence structure words and dependency structures related to the question sentence structure words, D is a decision attribute set comprising dependency structures related to focus words, V is an attribute value set, f is a UxA→V is an information function for assigning value to each attribute of each question sentence, if AndF (x, a) ∈v.

Further, the step S2 at least includes the following steps:

S2.1, for a question decision information table T= < U, A, V, f >, a epsilon A and V epsilon V are set, wherein a represents c _i or d _i, and (a, V) is an item, and phi is an item set formed by combining one or more items in a conjunctive form 'lambda';

S2.2, for a question decision information table T= < U, A=C U D, V, f >, C epsilon C, D epsilon D, V epsilon V are set, and (C, V) is a condition item, and (D, V) is a decision item;

s2.3, when a term set phi only contains conditional terms, then phi is a conditional term set;

S2.4, when a term set phi contains one or more condition terms and a decision term, then phi is a focus term set;

S2.5, when the occurrence frequency of a focus item set phi exceeds a minimum support threshold minsup specified by an expert, namely sup (phi) is more than or equal to minsup, the phi is a frequent focus item set;

S2.6, for the association rule phi ₁→φ₂, the current part phi ₁ is a condition item set, the back part phi ₂ is a decision item, and then phi ₁→φ₂ is a focus association rule;

S2.7 for the focus association rule phi ₁→φ₂, when the corresponding focus item set is frequent and its confidence exceeds the expert specified minimum confidence threshold minconf, i.e., sup (phi ₁∧φ₂) > minsup and conf (phi ₁→φ₂)＝sup(φ₁∧φ₂)/sup(φ₁) > minconf, then phi ₁→φ₂ is considered a strong focus association rule.

Further, the step S3 at least includes the following steps:

S3.1, combining each value in the question decision information table with the column name of the column to which the value belongs to form a term, and ignoring the part without the value;

S3.2, converting one line in the question decision information table into one transaction.

Further, the step S4 at least includes the following steps:

S4.1, converting the question decision information table T into a transaction data set delta _t;

S4.2, initializing an empty prefix tree set delta _tree;

S4.3, traversing each transaction t _i in the transaction data set delta _t in sequence and constructing decision tree branches;

S4.4 initializing a set δ _itemsets to be an empty set, which set is to be used for storing triples < itemset, δ _tids, treeid >;

s4.5, traversing each prefix tree in the prefix tree set delta _tree in turn Generating a set of items delta _itemsets;

S4.6, deleting infrequent item sets from the item set delta _itemsets;

s4.7, traversing each prefix tree in the prefix tree set delta _tree A strong focus association rule set delta _R is generated.

Further, the step S4.3 at least includes the following steps:

obtaining a set of decision terms from a transaction t _i And for each decision itemThe following steps are required:

Traversing the set of decision terms in turn Each of the decision terms in (a)

Selecting and deciding items from a set of prefix trees delta _tree Corresponding prefix tree

Each set of condition items in the transaction t _i is traversed in turnAnd for each set of condition itemsThe following steps are required:

When (when) Is present in (a)Corresponding branches, then add t _i to the transaction set delta _{tid_of_branch} corresponding to the branch, otherwiseNewly add a AND gateThe corresponding branch and for which a transaction set delta _{tid_of_branch} is created that contains only t _i.

Further, the step S4.5 at least includes the following steps:

Traversing prefix tree in turn And for each branch the following steps need to be performed:

Generating a conditional item set itemset of the current branch, a transaction sequence number set delta _tids and a sequence number treeid of the current prefix tree;

If there is a current set of conditional terms, itemset, in delta _itemsets, delta _tids and treeid for this set of terms are updated, otherwise < itemset, delta _tids, treeid > is added to the set delta _itemsets;

When the number of elements of the transaction sequence number set delta _tids is less than the minimum support minsup, the current branch is deleted.

Further, the step S4.7 at least includes the following steps:

Selecting a prefix tree from a set of items delta _itemsets Related item sets are stored in a related item set delta _{itemsets_x};

Traversing prefix tree in turn Each of which requires the following steps to be performed:

Acquiring the occurrence times n _c and n _d of a condition item set and a decision item in the current branch from delta _{itemsets_x};

If the confidence level conf (conf=n _d/n_c) is greater than the minimum confidence level minconf, generating a strong focus association rule r from the current branch;

And adding the generated strong focus association rule r and the confidence level conf thereof into the strong focus association rule set delta _R.

Further, the step S5 at least includes the following steps:

s5.1, generating a serial number for each strong focus association rule and the number of items in the front piece;

S5.2, forming an inverted index by the index item and the index record, wherein the index item is a condition item, the index record is formed by a plurality of triples, and each triplet corresponds to the statistical information of a strong focus association rule, and the statistical information comprises the sequence number of the rule, the number of items in the rule front part and the confidence of the rule.

Further, the step S6 at least includes the following steps:

S6.1, generating information such as question types, question structure words, parts of speech of the question structure words and various dependency structures from a natural language question Q, and putting the information into a question information set delta _information;

s6.2, generating values of all attribute columns of the natural language question Q in a decision information table according to the question set delta _information;

S6.3, converting the value of the natural language question Q in the decision information table into a term set, and putting the term set into delta _item;

S6.4, selecting an index record related to the current natural language question Q based on the item set delta _item and the inverted index set delta _index, and placing the index record in the index record set delta _statistics;

S6.5, selecting a triplet set with the occurrence number of the condition items equal to the number of items in the front piece from the index record set delta _statistics;

S6.6, sorting in descending order according to the confidence coefficient of the triples in the triples set, selecting the first k serial numbers of the triples with higher confidence coefficient, then selecting the rule corresponding to the serial number from the strong focus association rule set delta _R, and putting the rule into the rule set delta _rules;

S6.7, generating a final focus word based on the question information set delta _information and the rule set delta _rules.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention provides a method for identifying focus words in natural language questions, so that a question answering system can more accurately understand the focus points of a user;

2. The invention provides a prefix tree structure dominated by decision items, and further introduces an algorithm for mining strong focus association rules to identify focus words based on the prefix tree, wherein the algorithm is more efficient than a classical association rule mining algorithm Apriori;

3. the invention provides an inverted index aiming at a strong focus association rule, and further introduces an algorithm for identifying focus words based on the inverted index, wherein the algorithm is more efficient than sequential search;

4. The invention defines focus item sets, frequent focus item sets, focus association rules and strong focus association rules to better express information related to focus words.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a different training dataset of the present invention;

FIG. 2 is a comprehensive comparison chart of the present invention;

FIG. 3 is a graphical representation of the run time versus minimum confidence threshold (minconf) for MSFAR and Apriori of the present invention;

FIG. 4 is a graphical representation of the run time versus minimum support threshold (minsup) for MSFAR and Apriori of the present invention;

FIG. 5 is a graphical representation of the run time versus minimum confidence threshold (minconf) for the sequential index and inverted index of the present invention;

FIG. 6 is a graphical representation of the run time versus minimum support threshold (minsup) for the sequential index and inverted index of the present invention;

FIG. 7 is a graph showing the impact of the first k rules of the present invention;

FIG. 8 is a graph showing the effect of minimum support (minsup) and minimum confidence (minconf) on results for comparison in accordance with the present invention;

FIG. 9 is a graph showing the comparison of the effect of training data set size on results in accordance with the present invention;

FIG. 10 is a diagram showing a comparison of a verb prototype with a question sentence of the present invention, whether the question sentence is case-sensitive and whether the question sentence is reserved;

FIG. 11 is a three-stage schematic diagram of the mining strong focus association rule algorithm (MSFAR) of the present invention;

FIG. 12 is a schematic diagram of the stage of constructing a prefix tree dominated by decision terms by the algorithm MSFAR of the present invention;

FIG. 13 is a flow chart of a condition item set generation stage of the MSFAR algorithm of the present invention;

FIG. 14 is a flow chart illustrating a stage of generating strong focus association rules by the MSFAR algorithm of the present invention;

FIG. 15 is a flowchart of an inverted index based focus word recognition algorithm (IFW) according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.

Referring to fig. 11-15, a method for identifying focus words in natural language questions oriented to an intelligent question-answering system at least comprises the following steps:

S5, constructing an inverted index for the strong focus association rule;

The question decision information table is set as:

T=<U,A=C∪D,V,f>

Wherein U is a finite field composed of question, C is a conditional attribute set including question type, question structure word, part of speech of the question structure word and dependency structure related to the question structure word, D is a decision attribute set including dependency structure related to the focus word, V is an attribute value set, f is a information function for assigning value to each attribute of each question, if AndF (x, a) ∈v.

S2 at least comprises the following steps:

S3 at least comprises the following steps:

S4 at least comprises the following steps:

S4.2, initializing an empty prefix tree set delta _tree;

S4.6, deleting infrequent item sets from the item set delta _itemsets;

S4.3 at least comprises the following steps:

obtaining a set of decision terms from a transaction t _i

Traversing the set of decision terms in turnEach of the decision terms in (a)And for each decision itemThe following steps are required:

S4.5 at least comprises the following steps:

S4.7 comprises at least the following steps:

S5 at least comprises the following steps:

S5.2, forming an inverted index by the index item and the index record, wherein the index item is a condition item, the index record is formed by a plurality of triples, each triplet corresponds to the statistical information of a strong focus association rule, and the statistical information comprises the sequence number of the rule, the number of items in the rule front part and the confidence of the rule.

S6, at least comprising the following steps:

Based on the above examples, the following remarks are made:

The experimental data is derived from two publicly available natural language question datasets, LC-QuAD (containing 4625 questions) and QALD (containing 755 questions). 100 questions in the test set were randomly selected from LC-QuAD and QALD. The focus words of all questions are marked manually.

1. Validity of strong focus association rules

For the problem of identifying the focus word, we show the comparison of the proposed method with the existing machine learning algorithm in fig. 1 and 2. Wherein the horizontal axis is the minimum confidence threshold minconf and the vertical axis is the recognition rate of 100 questions. As can be seen from fig. 1 and 2, both for LC-QuAD dataset, QALD dataset, and for the combined dataset of LC-QuAD and QALD, our method is superior to existing machine learning algorithms in terms of accuracy, such as K-nearest neighbor (KNN), bayes (Bayes), decision Tree (Decision Tree), and Support Vector Machine (SVM). Furthermore, as shown in the right-hand sub-graph of fig. 2, the results obtained on the combined dataset have better stability and overall recognition rate (over 90%) than the results on either dataset alone.

2. Comparison of MSFAR algorithm with Apriori algorithm

A MSFAR algorithm was proposed for mining strong focus association rules. The result of the MSFAR algorithm in comparison with the classical association rule mining algorithm Apriori is shown in fig. 3 and 4. In fig. 3, the horizontal axis is the minimum confidence threshold minconf (in fig. 4, the horizontal axis is the minimum support threshold minsup), and the vertical axis is the runtime of mining the strong focus association rule. As can be seen from fig. 3, with the minimum support threshold minsup unchanged, the time for both algorithms to mine the strong focus association rule remains unchanged, regardless of the change in the minimum confidence threshold minconf. The root cause of this phenomenon is that for all association rules derived from frequent patterns, whether they are strong association rules or not, their support must be calculated. Therefore, the degree of support does not affect the computational complexity. As can be seen from fig. 4, with the minimum confidence threshold minconf unchanged, the run-time of mining the strong focus association rule gradually decreases as the minimum support threshold minsup gradually increases. The root cause of this phenomenon is that an increase in the minimum support threshold minsup results in a decrease in the number of frequent patterns, thereby reducing run time. Furthermore, as can be seen from fig. 3 and 4, the MSFAR algorithm requires much less time than the Apriori algorithm in both cases.

3. Validity of inverted index

To quickly find the strong focus association rule for identifying the focus word, we construct an inverted index for the strong focus association rule. The comparison of the inverted index with the general sequential index is shown in fig. 5 and 6. In fig. 5, the horizontal axis is the minimum confidence threshold minconf (in fig. 6, the horizontal axis is the minimum support threshold minsup), and the vertical axis is the runtime to find strong focus association rules. It can be seen from fig. 5 and 6 that no matter how the minimum confidence threshold minconf and the minimum support threshold minsup change, the lookup time required for the inverted index is much less than for the sequential index.

4. Influence of the first k rules

For each question, there are several candidate strong focus association rules, which are ordered according to their confidence. We evaluate the impact of using different numbers of rules on the recognition effect of the focused word. As can be seen from FIG. 7, the rule using the first two digits (top-2) of the confidence rank may significantly improve the outcome of focused word recognition compared to the rule using the first (top-1) of the confidence rank. However, further increasing the number of rules has little effect on the boosting result. Therefore, we default to the rule of confidence ranking for the first two digits.

5. Influence of minimum support (minsup) and minimum confidence (minconf) on results

When mining strong focus association rules, two parameters, minimum support (minsup) and minimum confidence (minconf), are involved, so we demonstrate their impact on focus word recognition accuracy. As can be seen from fig. 8, with a gradual increase in minsup or minconf, there is a slight and gradual decrease in the accuracy of the focus word recognition. Furthermore, minconf has less effect on accuracy than minsup, the former results in accuracy degradation within 0.05, and the latter results in degradation within 0.1. In general, although these two parameters do have an effect on the results, their effect is relatively small, especially minconf.

6. Influence of training data set size on results

The size of the training data set is an essential consideration in machine learning, so we demonstrate the impact of training data set size on accuracy, as shown in figure 9. The combined dataset of LC-QuAD and QALD contained 5380 questions, 100 of which were used as test sets, and 5000 questions were selected as training sets from the remaining questions, the 5000 questions being divided equally into 10 equal parts, based on one of them, and a new part was added each time until all 10 parts were combined, thus forming 10 datasets, varying in size from 500 to 5000. As can be seen from fig. 9, regardless of the variation of the parameters, when the number of questions reaches 3000, the accuracy tends to be stable and hardly changed, which means that the quality of questions in the training set is more important than the number.

7. Influence of query words and first verbs on results

The question dataset contains questions and statements. On the one hand, the question words (such as who, which, what) In the question sentence differ In terms of the case, for example, "Which companies are In the computer software industry. On the other hand, the first verb forms in the statement sentence are diverse, for example, "GIVE ME THE homepage offorbes" and "List allboardgames by gmt", and therefore, whether they need to be converted into verb originals becomes a problem. We therefore experimentally propose two candidate solutions for each problem, as shown in fig. 10. The Y-axis represents the number of questions for which the focus word is accurately recognized. We used 50 questions and 50 statements as two test sets, respectively. As can be seen from fig. 10, when the threshold is small, there is no difference between the two candidate solutions for each problem. And when the threshold value is larger, the accuracy of the case-distinguishing result is slightly higher than that of the case-indistinguishable result, and the accuracy of the result using the verb primitive is slightly higher than that of the result using the verb form. Overall, the two problems described above have little impact on the results, since the strong focus association rule is based mainly on dependency structure rather than the word itself. Furthermore, the focused word in the statement sentence can be accurately recognized, and all cases of recognition failure are concentrated in the question sentence.

To sum up:

1. The prior art ignores focused word recognition, and a question-answering system is difficult to accurately understand user intention. The focus word recognition is utilized, so that the system can accurately grasp the attention points of the user, and the understanding capability of natural language questions is improved;

2. The existing association rule mining algorithm does not consider the specificity of the focus association rule, and the mining efficiency is low. The patent uses the prefix tree structure and the mining algorithm which are dominant by the decision item, so that the rules can be mined more quickly;

3. When a proper strong focus association rule is selected for a natural language question, the sequential search mode needs to be compared one by one, and the efficiency is low. The inverted index aiming at the strong focus association rule and the identification algorithm thereof provided by the patent are more efficient than a sequential search mode;

4. the prior art lacks definitions of terms of art for the concepts related to the focus word, which is easily confusing during communication and research. The related terms defined by the patent enable the rules and information related to the focus words to be more clear and accurate.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. A method for identifying focus words in natural language questions oriented to an intelligent question-answering system is characterized by at least comprising the following steps:

S5, constructing an inverted index for the strong focus association rule;

2. The method for recognizing focus words in natural language questions for intelligent question-answering system according to claim 1, wherein the question decision information table is set as follows:

T=<U,A=C∪D,V,f>

3. The method for recognizing focus words in natural language questions oriented to intelligent question-answering system according to claim 2, wherein S2 comprises the following steps:

4. The method for recognizing focus words in natural language questions oriented to intelligent question-answering system according to claim 3, wherein S3 comprises the following steps:

5. The method for recognizing focus words in natural language questions directed to intelligent question-answering system according to claim 4, wherein S4 comprises the following steps:

S4.2, initializing an empty prefix tree set delta _tree;

S4.6, deleting infrequent item sets from the item set delta _itemsets;

6. The method for recognizing focus words in natural language questions directed to intelligent question-answering system according to claim 5, wherein S4.3 comprises the following steps:

obtaining a set of decision terms from a transaction t _i

Each set of condition items in the transaction t _i is traversed in turnAnd for each set of condition itemsThe following steps need to be performed whenIs present in (a)Corresponding branches, then add t _i to the transaction set delta _{tid_of_branch} corresponding to the branch, otherwiseNewly add a AND gateThe corresponding branch and for which a transaction set delta _{tid_of_branch} is created that contains only t _i.

7. The method for recognizing focus words in natural language questions directed to intelligent question-answering system according to claim 5, wherein S4.5 comprises the following steps:

8. The method for recognizing focus words in natural language questions directed to intelligent question-answering system according to claim 5, wherein S4.7 comprises the following steps:

9. The method for recognizing focus words in natural language questions directed to intelligent question-answering system according to claim 5, wherein S5 comprises the following steps:

10. The method for recognizing focus words in natural language questions directed to intelligent question-answering system according to claim 5, wherein S6 comprises at least the following steps: