[go: up one dir, main page]

WO2012130145A1 - 获取和搜索相关知识信息的方法及装置 - Google Patents

获取和搜索相关知识信息的方法及装置 Download PDF

Info

Publication number
WO2012130145A1
WO2012130145A1 PCT/CN2012/073234 CN2012073234W WO2012130145A1 WO 2012130145 A1 WO2012130145 A1 WO 2012130145A1 CN 2012073234 W CN2012073234 W CN 2012073234W WO 2012130145 A1 WO2012130145 A1 WO 2012130145A1
Authority
WO
WIPO (PCT)
Prior art keywords
query
question
hotspot
knowledge
page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2012/073234
Other languages
English (en)
French (fr)
Inventor
杨明
王源
唐曼华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Original Assignee
Baidu Online Network Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baidu Online Network Technology Beijing Co Ltd filed Critical Baidu Online Network Technology Beijing Co Ltd
Priority to JP2014501426A priority Critical patent/JP5780617B2/ja
Publication of WO2012130145A1 publication Critical patent/WO2012130145A1/zh
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3349Reuse of stored results of previous queries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9574Browsing optimisation, e.g. caching or content distillation of access to content, e.g. by caching

Definitions

  • the present invention relates to the field of Internet communication technologies, and in particular, to a method and apparatus for acquiring and searching for related knowledge information.
  • the knowledge question answering system is a system that uses communication functions to achieve information acquisition. Users can submit various questions in the knowledge question answering system through the web page, query the status of the submitted questions, and decide which answer to use based on the status of the question answer. Other users can view the problem by visiting this page and answering according to their own preferences and knowledge.
  • the invention provides a method and device for acquiring and searching related knowledge information, so as to provide related knowledge information quickly and accurately.
  • a method for obtaining relevant knowledge information comprising:
  • step B Use the query excavated in step A to form a question and publish it on the page of the knowledge question answering platform;
  • the step A specifically includes:
  • a query having a question request is identified in the search log, and a hotspot query is determined in the search log, and the identified query with the question requirement and the determined hot spot query are taken together.
  • the query identifying the query request specifically includes:
  • the words obtained after the word segmentation are respectively matched in the question attribute database to determine the question propensity score of each word;
  • the query attribute database stores each word obtained by the data mining method or the manual configuration method and the question propensity score corresponding to each word.
  • the question propensity score corresponding to the word is determined by the following factors:
  • the words are interrogative words, or the relationship between words and interrogative words.
  • the determining the hotspot query may include:
  • the search frequency of each query in each query group is added to determine the search frequency of each query group
  • the query formation excavated by using step A in step B specifically includes:
  • the excavated query is subjected to semantic-based word segmentation, and the words are tagged;
  • the word-processed word is compared with the pre-set question-sentence grammar, and the missing word is added to the word-processed word, and assembled into a question satisfying the question-sentence grammar.
  • the posting of the question on the page of the knowledge question answering platform specifically includes:
  • the step C specifically includes:
  • the method further includes:
  • any relevant knowledge information of the question has not appeared, or the quality answer of the question has not yet appeared, then the knowledge is closed.
  • a method for searching related knowledge information is based on the foregoing method for acquiring related knowledge information, and the method for searching related knowledge information includes:
  • Searching for a page matching the keyword of the query wherein if a page matching the keyword of the query is searched on the knowledge question answering platform, matching the keyword of the query on the knowledge question answering platform The page is included in the search results of the query and returned to the user.
  • An apparatus for acquiring related knowledge information comprising: a search request query mining unit, a question forming unit, a question issuing unit, and a knowledge acquiring unit;
  • the query mining unit is configured to analyze a search log and mine a hotspot query with a questionable requirement
  • the question forming unit is configured to use the query excavated by the query mining unit to form a question
  • the question issuing unit is configured to post the question on a page of the knowledge question answering platform
  • the knowledge acquisition unit is configured to acquire related knowledge information of the question through a page of the knowledge question and answer platform.
  • the query mining unit specifically includes: a requirement identification subunit and a hotspot determination subunit;
  • the requirement identification subunit is configured to identify and output a query having a question request from the input query
  • the hotspot determining subunit is configured to determine and output a hotspot query from the input query
  • the input of the requirement identification subunit is a query in a search log
  • the input of the hotspot determination subunit is an output of the requirement identification subunit
  • the output of the hotspot determination subunit is the questionable requirement Hot spot query
  • the input of the hotspot determining subunit is a query of the search log
  • the input of the demand identifying subunit is an output of the hotspot determining subunit
  • the output of the demand identifying subunit is the hotspot query having the question demand
  • the input of the requirement identification subunit is a query in the search log
  • the input of the hotspot determination subunit is also a query in the search log
  • the device further includes: an intersection processing subunit, configured to determine the hot spot
  • the subunit and the requirement identification subunit take an intersection and output a hotspot query with a questionable requirement.
  • the requirement identification subunit specifically includes: a word segmentation processing module, a word scoring module, a query scoring module, and a requirement judging module;
  • the word segmentation processing module is configured to perform semantic-based word segmentation processing on the input query
  • the word scoring module is configured to match each word obtained after the word segmentation processing in a question attribute database, and determine a question propensity score of each word;
  • the query scoring module is configured to add the question propensity scores of the words to obtain a question propensity score of the input query;
  • the requirement judging module is configured to determine whether the interrogation tendency score of the input query exceeds a preset question requirement threshold, and if yes, determine that the input query has a question requirement; otherwise, the input query is determined not to be Questionable demand;
  • the query attribute database stores each word obtained by the data mining method or the manual configuration method and the question propensity score corresponding to each word.
  • the question propensity score corresponding to the word is determined by the following factors:
  • the words are interrogative words, or the relationship between words and interrogative words.
  • the hotspot determining subunit specifically includes: a clustering processing module, a frequency statistics module, a hotspot group determining module, and a hotspot query determining module;
  • the clustering processing module is configured to perform correlation-based clustering on the query to obtain each query group;
  • the frequency statistics module is configured to add search frequencies of each query in each query group to determine a search frequency of each query group;
  • the hotspot group determining module is configured to determine a query group whose search frequency exceeds a preset hotspot frequency as a hotspot query group;
  • the hotspot query determining module is configured to select a query from each hotspot query group as a hotspot query.
  • the question forming unit may include: a part-of-speech identifier sub-unit and a sentence assembly sub-unit;
  • the part-of-speech identifier sub-unit is configured to perform a semantic-based word segmentation process on the query excavated by the query mining unit, and put a part-of-speech tag;
  • the sentence assembly sub-unit is configured to compare the word-processed word with a pre-set question-sentence grammar according to the part-of-speech tag, add a missing word to the word-processed word, and assemble the satisfied word Ask questions about the sentence grammar.
  • the question issuing unit specifically selects an ID from a preset set of simulated question IDs, and uses the selected ID to simulate that the user forms a question formed by the question forming unit on a page of the knowledge question answering platform;
  • the ID in the set of simulated question IDs is defaulted to the ID of the registered user by the knowledge question answering platform.
  • the knowledge acquisition unit specifically obtains relevant knowledge information that is answered by the user for the question and answers from the page of the knowledge question answering platform, and determines a high quality answer from the related knowledge information.
  • the device further includes:
  • a page maintenance unit configured to: when the posting duration of the question on the page of the knowledge question answering platform reaches a preset closing duration, if any relevant knowledge information of the question has not appeared, or the quality of the question has not yet appeared The answer is to close the page where the question is located on the knowledge quiz platform.
  • An apparatus for searching related knowledge information comprising: the foregoing apparatus for acquiring related knowledge information, a user interaction unit, and a page search unit;
  • the user interaction unit is configured to receive a query input by a user
  • the page search unit is configured to search for a page that matches the keyword of the query, and if the device that obtains the relevant knowledge information from the above is posted on the knowledge question answering platform, the key of the query is searched. If the word matches the page, the searched page is included in the search result of the query and returned to the user.
  • the present invention mines the hotspot query with questioning requirements by analyzing the search log, and uses the excavated query to form a question and publish it on the page of the knowledge question answering platform, so that the user has relevant question when the search engine
  • the page on which the question is located on the knowledge quiz platform can be returned to the user, so that the user can obtain relevant knowledge information of the question from the page. That is to say, through the invention, the relevant knowledge information existing on the knowledge question answering platform can be quickly and accurately provided by the search engine, and the user does not have to log in to the knowledge question answering platform to issue a question, and waits for the question to be answered before the relevant knowledge information can be obtained.
  • FIG. 1 is a flowchart of a method for acquiring related knowledge information according to Embodiment 1 of the present invention
  • FIG. 2 is a flowchart of a method for determining a hotspot query according to Embodiment 2 of the present invention
  • Embodiment 3 is a flowchart of a method for searching for related knowledge information according to Embodiment 3 of the present invention.
  • FIG. 4 is a structural diagram of an apparatus for acquiring related knowledge information according to Embodiment 4 of the present invention.
  • Figure 5 (a), (b) and (c) are three structural diagrams of the query mining unit provided in the fourth embodiment of the present invention.
  • FIG. 6 is a structural diagram of a requirement identification subunit provided by Embodiment 4 of the present invention.
  • FIG. 7 is a structural diagram of a hotspot determining subunit provided by Embodiment 4 of the present invention.
  • FIG. 8 is a structural diagram of an apparatus for searching for related knowledge information according to Embodiment 5 of the present invention.
  • FIG. 1 is a flowchart of a method for acquiring related knowledge information according to Embodiment 1 of the present invention. As shown in FIG. 1 , the method may include the following steps:
  • Step 101 Analyze the search log and mine a hotspot query with questionable requirements.
  • the search log can be periodically analyzed to capture the search log in the current period; then the search log in the current period of the crawl is used to mine the hotspot query with the question.
  • the period for analyzing the search log can be flexibly set. For example, the hotspot query with the doubtful demand is extracted from the search log of the day in a daily cycle.
  • This step is actually divided into two parts: one part is to identify whether the query in the search log has a questionable requirement; the other part is to determine the hotspot query.
  • the operations of these two parts can be performed in any order, or they can be executed in parallel, and finally the hotspot query with doubtful requirements is mined. That is, you can first identify the query with doubtful requirements in the search log, and then determine the hotspot query in the query with questioning requirements; you can also determine the hotspot query first, and then identify the query with the question in the hot query; Synchronize the query and hotspot query with doubtful requirements, and then take the intersection of the two.
  • the process of identifying whether the query has a questionable requirement may include: performing semantic-based word segmentation processing on the query, matching each word obtained after the word segmentation processing in the question attribute database, and determining the question propensity score of each word; The interrogative tendency scores of the words are added to obtain the questioning tendency score of the query; if the questioning tendency score of the query exceeds the preset question demand threshold, it is determined that the query has a questionable demand; otherwise, the query is determined to have no doubt demand.
  • the query attribute database stores the words obtained by the data mining method or the manual configuration method and the corresponding question tendency scores.
  • the question propensity score corresponding to each word in the question attribute database may be determined by, but not limited to, whether the word is an interrogative word, an association relationship between the word and the question word. For example, for question words such as "what”, “what”, “how”, “how”, “why”, etc., you can set the highest question propensity score; for words that are often used as the context of question words, such as "practice”, “Method”, “method”, etc. can be considered to have a strong correlation with the interrogative words, and can set a higher questioning tendency score; for other words with less interrogative words, a smaller questioning tendency can be set. Score.
  • the "query” query after the semantic-based word segmentation process, obtains the words “fishy shredded pork” and "practice". After matching these two words in the question attribute database, it is determined that "fishy pork" "There is no matching word in the question attribute database, and the question propensity score is 0. After the "practice” is matched in the question attribute database, the question propensity score is determined to be 70. After adding, it is determined that the question's questioning tendency score is 70. If the set question demand threshold is 60 points, the query may be considered to have a questionable demand.
  • the hot query that is finally discovered and has a questionable requirement can be stored as a file in the database.
  • Step 102 The question is generated by using the excavated query and posted on the page of the knowledge question answering platform.
  • the excavated query can be separately analyzed and assembled based on semantics to form a question.
  • the excavated query is subjected to semantic-based word segmentation and words are tagged with part-of-speech tags.
  • these words are compared with the pre-set question sentence grammar, and the missing words are added to form a question that satisfies the grammar of the question sentence.
  • the question sentence grammar can be flexibly set, as long as the requirements of the commonly used question syntax are met.
  • the sentence syntax for setting a question is: [adjective/noun + function word] + noun + verb + question auxiliary word + question symbol, where [] indicates an option. If the words obtained by a query after word segmentation are nouns and verbs, you can fill in the appropriate interrogative auxiliary and question symbols, and finally assemble the questions.
  • the word “fishy shredded pork” is tagged with nouns, the “practice” is tagged with nouns, and then compared with the pre-defined question sentence grammar.
  • the wording of the word, the interrogative auxiliary and the symbol, the question that can be formed can be "how is the practice of fish-flavored pork?" ".
  • the knowledge quiz platform manages the registered users by ID.
  • the simulated question ID set can be preset in advance.
  • the IDs in the ID set are all defaulted by the knowledge quiz platform as the ID of the registered user.
  • the ID can be selected from the preset set of simulated question IDs. Unused IDs are published to challenge the registered users on the Knowledge Q&A platform to ask questions.
  • the questions involved in the present invention are not limited to common problems, and may be applied to other forms of questions, for example, may be a question asking for a document, and the related knowledge information of the question may be a document uploaded by another user. .
  • Step 103 Obtain relevant knowledge information of the question through a page on the knowledge question answering platform.
  • the registered user on the knowledge quiz platform answers the question page to provide relevant knowledge information.
  • the high-quality answer can be determined in the relevant knowledge information answered on the page, wherein the high-quality answer can be determined by the administrator of the knowledge question answering platform, or can be automatically determined by the knowledge question answering platform according to the preset high-quality answer selection strategy.
  • the high-quality answer selection strategy may be determined by one or any combination of the following factors: the user level that answers the question, the adoption rate of the question answered by the user, and the length of the related knowledge information.
  • FIG. 2 is a flowchart of determining a hotspot query according to Embodiment 2 of the present invention. As shown in FIG. 2, the process may include the following steps:
  • Step 201 Perform correlation-based clustering on the query to obtain each query group.
  • the clustering object of this step is: the crawled search The query in the log.
  • the cluster object of this step is: the query with the question requirement identified in the search log.
  • each query contained in each query group has a high correlation. For example, the correlation between the “World Expo”, “Expo” and “Expo” is very high, and the cluster is satisfied. If required, these queries are clustered into a single query.
  • Step 202 Add the search frequency of each query in the query group to determine the search frequency of the entire query group.
  • the search frequency of each query can be counted, and the search frequency of each query in each query group is added, which can be used as the search frequency of the entire query group, reflecting the heat of the entire query group.
  • Step 203 Determine that the search frequency of the query group exceeds the preset hotspot frequency. If yes, execute step 204; otherwise, determine that the query group is not a hotspot query group.
  • the search frequency of “World Expo” is 10,000 times within the set time
  • the search frequency of “World Expo” within the set time is 20,000 times
  • the "Expo" search frequency is 30,000 times in the set time
  • the search frequency corresponding to the set time of the entire query group is 60,000 times. If the preset hotspot frequency is 50,000 times, it can be determined that the query group is a hotspot query group.
  • Step 204 Determine that the query group is a hotspot query group, and select a query from the hotspot query group as a hotspot query.
  • the strategy of selecting a hotspot query from the hot query group may include, but is not limited to, the following strategies: selecting the query with the highest search frequency, selecting any query, selecting the query with the best semantic integrity, and the like.
  • FIG. 3 is a flowchart of a method for searching for related knowledge information according to Embodiment 3 of the present invention. As shown in FIG. 3, the method for searching related knowledge information may include the following steps:
  • Step 301 Receive a query input by a user.
  • Step 302 Search for a page that matches the keyword of the query; wherein if the page on the knowledge quiz platform that matches the keyword of the query is searched, then the knowledge quiz platform and the query are The matching keywords of the keywords are included in the search results of the query and returned to the user.
  • the search engine when the search engine receives the query of the user input sent by the browser, when the search page is searched according to the query input by the user, the background has already simulated the user's question and posted in the process shown in FIG. 1 in advance.
  • the search engine searches for a page matching the keyword of the query from the captured page, it can match the page on the knowledge quiz platform that matches the keyword of the query, The page already contains relevant questions and relevant knowledge information for answering the questions.
  • the search engine can Quickly and accurately feedback relevant knowledge information already available on the knowledge question and answer platform in the search results.
  • the page of the knowledge quiz platform can be specially processed, that is, the search engine is allowed to capture the page that already has a good answer on the knowledge question answering platform, that is, if there is no high-quality answer on the question page of the knowledge question answering platform, the feedback is The question page will not be included in the search results for the user.
  • the apparatus may include: a query mining unit 400, a question forming unit 410, a question issuing unit 420, and a knowledge acquiring unit 430.
  • the query mining unit 400 is configured to analyze the search log and mine a hotspot query with questionable requirements.
  • the search log analyzed by the query mining unit 400 may be a search log periodically captured.
  • the question forming unit 410 is configured to form a question by using the hotspot query excavated by the query mining unit 400.
  • the question issuing unit 420 is configured to post the question on the page of the knowledge question answering platform.
  • the knowledge acquisition unit 430 is configured to obtain related knowledge information of the question through the page of the knowledge question answering platform.
  • the question issuing unit 420 and the knowledge obtaining unit 430 may be units independent of the knowledge question answering platform, or may be units arranged in the knowledge question answering platform.
  • the structure of the query mining unit 400 may be as shown in FIG. 5, and specifically includes a requirement identification subunit 401 and a hotspot determination subunit 402.
  • the requirement identification sub-unit 401 is configured to identify and output a query with a question request from the input query.
  • the hotspot determination subunit 402 is configured to determine and output a hotspot query from the input query.
  • the input of the demand identification subunit 401 may be the query in the captured search log, and the input of the hotspot determination subunit 402 is the output of the demand identification subunit 401. At this time, the output of the hotspot determination subunit 402 is in doubt. Hot spot query for demand.
  • the connection relationship between the demand identification sub-unit 401 and the hot spot determination sub-unit 402 in this case is as shown in (a) of FIG.
  • the input of the hotspot determination subunit 402 is the query of the search log
  • the input of the demand identification subunit 401 is the output of the hotspot determination subunit 402.
  • the output of the requirement identification subunit 401 is the hotspot query with the question demand.
  • the connection relationship between the demand identification sub-unit 401 and the hotspot determination sub-unit 402 in this case is as shown in (b) of FIG.
  • the input of the requirement identification sub-unit 401 is the query in the captured search log
  • the input of the hotspot determination sub-unit 402 is also the query in the captured search log, in which case the requirement identification sub-unit 401 and
  • the connection relationship of the hotspot determination sub-unit 402 is as shown in (c) of FIG. 5, and the apparatus may further include a sub-unit that intersects the hot spot determination sub-unit 402 and the demand identification sub-unit 401, that is, in FIG. 5 ( c) The intersection processing sub-unit 403 shown, whose output is a hotspot query with questionable requirements.
  • the structure of the requirement identification sub-unit 401 can be as shown in FIG. 6, and specifically includes: a word segmentation processing module 601, a word scoring module 602, a query scoring module 603, and a requirement judging module 604.
  • the word segmentation processing module 601 is configured to perform semantic-based word segmentation processing on the input query.
  • the word scoring module 602 is configured to match each word after the word segmentation in the question attribute database to determine the question propensity score of each word.
  • the question attribute database stores each word obtained by the data mining method or the manual configuration method and the question propensity score corresponding to each word.
  • the query scoring module 603 is configured to add the interrogation tendency scores of the words to obtain the interrogation tendency score of the input query.
  • the requirement judging module 604 is configured to determine whether the interrogation score of the input query exceeds a preset question demand threshold, and if so, determine that the entered query has a question requirement; otherwise, the input query is determined to have no doubt demand.
  • the question propensity score corresponding to the above words may be determined by, but not limited to, the following factors: whether the words are interrogative words, or the relationship between the words and the interrogative words.
  • the structure of the hotspot determining sub-unit 402 may be as shown in FIG. 7, and specifically includes: a clustering processing module 701, a frequency statistics module 702, a hotspot group determining module 703, and a hotspot query determining module 704.
  • the clustering processing module 701 is configured to perform correlation-based clustering on the query to obtain each query group.
  • the frequency statistics module 702 is configured to add the search frequency of each query in each query group to determine the search frequency of each query group.
  • the hotspot group determining module 703 is configured to determine the query group whose search frequency exceeds the preset hotspot frequency as the hotspot query group.
  • the hotspot query determining module 704 is configured to select a query from each hotspot query group as a hotspot query.
  • the strategy for selecting a hotspot query from the hotspot query group may include, but is not limited to, selecting the query with the highest search frequency, selecting any query, or selecting the query with the best semantic integrity.
  • the question forming unit 410 may specifically include a part-of-speech identifier sub-unit 411 and a sentence assembly sub-unit 412.
  • the part-of-speech tag sub-unit 411 is configured to perform a semantic-based word segmentation process on the hotspot query excavated by the query mining unit 400, and tag the word tag.
  • the part-of-speech tag sub-unit 411 may itself have the function of word segmentation processing, that is, the part-of-speech tag sub-unit 411 first performs semantic-based word segmentation processing on the hotspot query excavated by the query mining unit 400, and then puts the words obtained by the word segmentation into words. label.
  • the part-of-speech tag sub-unit 411 may not have the function of word segmentation processing, and directly utilizes the word segmentation processing result of the hotspot query in the word segmentation processing module 601 in the requirement identification sub-unit 401, and puts the word obtained after the word segmentation process into a part-of-speech tag.
  • the sentence assembly sub-unit 412 is configured to compare the words obtained after the word segmentation with the pre-set question sentence grammar according to the part-of-speech tag, add the missing words to the word-processed words, and assemble the questions into the question grammar. .
  • the knowledge question answering platform manages the registered users by ID.
  • the simulated question ID set can be set in advance, and the simulated question ID set is simulated.
  • the ID is defaulted to the ID of the registered user by the knowledge quiz platform.
  • the question issuing unit 420 at this time may select one ID from the set of simulated question IDs set in advance, and simulate the question formed by the question forming unit 410 by the user using the selected ID to be posted on the page of the knowledge question answering platform.
  • the knowledge obtaining unit 430 can obtain relevant knowledge information of the answering user's answer to the question from the page of the knowledge question answering platform, and further determine the high quality answer from the related knowledge information.
  • the high-quality answer may be determined by the administrator's participation, or may be determined by the knowledge question answering platform according to one or a combination of the user level of the answering question, the question adoption rate of the user who answers the question, and the length of the related knowledge information.
  • the device further includes: a page maintenance unit 440, configured to advance the time of posting on the page of the knowledge question answering platform.
  • a page maintenance unit 440 configured to advance the time of posting on the page of the knowledge question answering platform.
  • FIG. 8 is a structural diagram of an apparatus for searching for related knowledge information according to Embodiment 5 of the present invention.
  • the apparatus includes: the apparatus shown in FIG. 4, a user interaction unit 801, and a page search unit 802.
  • the user interaction unit 801 is configured to receive a query input by a user.
  • the page search unit 802 is configured to search for a page that matches the keyword of the query. If the device shown in FIG. 4 searches for a page matching the query keyword on the page where the question is posted on the knowledge question answering platform, The searched page is included in the search results of the query and returned to the user.
  • the page crawled by the search engine also contains the page on the knowledge quiz platform.
  • the page of the knowledge question answering platform can be specially processed, that is, the page searching unit 802 is allowed to search for a page that already has a good answer on the knowledge question answering platform. If there is no good answer in the question page of the knowledge question answering platform, then return to the page. The user's search results will not include the question page, that is, the search engine can not capture the page on the knowledge quiz platform for the question that has not yet appeared a good answer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供了一种获取和搜索相关知识信息的方法及装置,其中方法包括:分析搜索日志,挖掘出具有疑问需求的热点搜索请求(query);利用挖掘出的query形成提问并发布在知识问答平台的页面上;通过所述知识问答平台的页面获取所述提问的相关知识信息。当接收到用户输入的query后,搜索与该query的关键词相匹配的页面,如果搜索到所述知识问答平台上与该query的关键词相匹配的页面,则将所述知识问答平台上与该query的关键词相匹配的页面包含在该query的搜索结果中返回给用户。通过本发明能够快速且准确地向用户提供相关知识信息,而不必用户登陆知识问答平台发布提问,且等待提问被解答才能获取相关知识信息。

Description

获取和搜索相关知识信息的方法及装置
本申请要求了申请日为2011年03月31日,申请号为201110081274.4发明名称为“获取和搜索相关知识信息的方法及装置”的中国专利申请的优先权。
【技术领域】
本发明涉及互联网通信技术领域,特别涉及一种获取和搜索相关知识信息的方法及装置。
【背景技术】
随着互联网技术的迅速发展,通过互联网获取信息以及进行相互通讯已经成为人们每天生活的一部分。知识问答系统就是一种利用通讯功能实现信息获取的系统,用户可以通过网页在知识问答系统提交各种问题,查询所提交问题的状态,根据问题回答的状况决定采用哪个答案。其他用户可以通过访问该网页查看问题,并根据自己的喜好和知识进行回答。
然而,用户在知识问答系统上提问后,需要等待其他用户对该问题进行回答才能获取到需要的知识信息,这就会造成用户存在急需解答的紧迫性问题时,无法快速且准确地提供相关知识信息。
【发明内容】
本发明提供了一种获取和搜索相关知识信息的方法及装置,以便于快速且准确地提供相关知识信息。
具体技术方案如下:
一种获取相关知识信息的方法,该方法包括:
A、分析搜索日志,挖掘出具有疑问需求的热点搜索请求query;
B、利用步骤A挖掘出的query形成提问并发布在知识问答平台的页面上;
C、通过所述知识问答平台的页面获取所述提问的相关知识信息。
其中,所述步骤A具体包括:
在所述搜索日志中识别出具有疑问需求的query,在具有疑问需求的query中确定热点query;或者,
在所述搜索日志中确定热点query,在确定出的热点query中识别出具有疑问需求的query;或者,
在所述搜索日志中识别出具有疑问需求的query,并且在所述搜索日志中确定热点query,将识别出的具有疑问需求的query和确定的热点query取交集。
另外,所述识别出具有疑问需求的query具体包括:
将query进行基于语义的分词处理;
将分词处理后得到的各词语分别在疑问属性数据库中进行匹配,确定各词语的疑问倾向分值;
将所述各词语的疑问倾向分值相加后,得到query的疑问倾向分值;
判断所述query的疑问倾向分值是否超过预设的疑问需求阈值,如果是,则确定该query具有疑问需求;否则确定该query没有疑问需求;
其中,所述疑问属性数据库中存储经过数据挖掘方式或者人工配置方式得到的各词语以及各词语对应的疑问倾向分值。
所述词语对应的疑问倾向分值由以下因素决定:
词语是否为疑问词,或者,词语与疑问词之间的关联关系。
具体地,所述确定热点query可以包括:
对query进行基于相关性的聚类得到各query组;
将每一个query组中各query的搜索频次进行相加,确定每一个query组的搜索频次;
将搜索频次超过预设的热点频次的query组确定为热点query组;
从所述热点query组中选择一个query作为热点query。
步骤B中所述利用步骤A挖掘出的query形成提问具体包括:
将挖掘出的query进行基于语义的分词处理后的词语,打上词性标签;
按照打上的词性标签,将所述分词处理后的词语与预先设置的提问句子语法进行比较,针对分词处理后的词语添加缺少的词语,组装成满足所述提问句子语法的提问。
其中,将所述提问发布在知识问答平台的页面上具体包括:
从预先设置的模拟提问ID集合中选择一个ID,利用该ID模拟用户将所述提问发布在知识问答平台的页面上;所述模拟提问ID集合中的ID被所述知识问答平台默认为注册用户的ID。
较优地,所述步骤C具体包括:
从所述知识问答平台的页面上获取回答用户针对所述提问回答的相关知识信息,并从所述相关知识信息中确定出优质答案。
更进一步地,该方法还包括:
如果所述提问在所述知识问答平台的页面上的发布时长达到预设的关闭时长时,尚未出现所述提问的任何相关知识信息,或者尚未出现所述提问的优质答案,则关闭所述知识问答平台上所述提问所在的页面。
一种搜索相关知识信息的方法,该方法基于上述获取相关知识信息的方法,所述搜索相关知识信息的方法包括:
接收用户输入的query;
搜索与所述query的关键词相匹配的页面;其中如果搜索到知识问答平台上与所述query的关键词相匹配的页面,则将所述知识问答平台上与所述query的关键词相匹配的页面包含在所述query的搜索结果中返回给用户。
一种获取相关知识信息的装置,该装置包括:搜索请求query挖掘单元、提问形成单元、提问发布单元和知识获取单元;
所述query挖掘单元,用于分析搜索日志,挖掘出具有疑问需求的热点query;
所述提问形成单元,用于利用所述query挖掘单元挖掘出的query形成提问;
所述提问发布单元,用于将所述提问发布在知识问答平台的页面上;
所述知识获取单元,用于通过所述知识问答平台的页面获取所述提问的相关知识信息。
其中,所述query挖掘单元具体包括:需求识别子单元和热点确定子单元;
所述需求识别子单元,用于从输入的query中识别出并输出具有疑问需求的query;
所述热点确定子单元,用于从输入的query中确定并输出热点query;
其中,所述需求识别子单元的输入为搜索日志中的query,所述热点确定子单元的输入为所述需求识别子单元的输出,所述热点确定子单元的输出为所述具有疑问需求的热点query;
或者,热点确定子单元的输入为搜索日志的query,所述需求识别子单元的输入为所述热点确定子单元的输出,所述需求识别子单元的输出为所述具有疑问需求的热点query;或者,
所述需求识别子单元的输入为搜索日志中的query,所述热点确定子单元的输入也为搜索日志中的query,此时该装置还包括:交集处理子单元,用于将所述热点确定子单元和所述需求识别子单元取交集,输出具有疑问需求的热点query。
所述需求识别子单元具体包括:分词处理模块、词语打分模块、query打分模块和需求判断模块;
所述分词处理模块,用于将输入的query进行基于语义的分词处理;
所述词语打分模块,用于将分词处理后得到的各词语分别在疑问属性数据库中进行匹配,确定各词语的疑问倾向分值;
所述query打分模块,用于将所述各词语的疑问倾向分值相加后,得到所述输入的query的疑问倾向分值;
所述需求判断模块,用于判断所述输入的query的疑问倾向分值是否超过预设的疑问需求阈值,如果是,则确定所述输入的query具有疑问需求;否则确定所述输入的query没有疑问需求;
其中,所述疑问属性数据库中存储经过数据挖掘方式或者人工配置方式得到的各词语以及各词语对应的疑问倾向分值。
具体地,所述词语对应的疑问倾向分值由以下因素决定:
词语是否为疑问词,或者,词语与疑问词之间的关联关系。
另外,所述热点确定子单元具体包括:聚类处理模块、频次统计模块、热点组确定模块和热点query确定模块;
所述聚类处理模块,用于对query进行基于相关性的聚类得到各query组;
所述频次统计模块,用于将每一个query组中各query的搜索频次进行相加,确定每一个query组的搜索频次;
所述热点组确定模块,用于将搜索频次超过预设的热点频次的query组确定为热点query组;
所述热点query确定模块,用于从每一个热点query组中选择一个query作为热点query。
具体地,所述提问形成单元可以包括:词性标识子单元和句子组装子单元;
所述词性标识子单元,用于将所述query挖掘单元挖掘出的query进行基于语义的分词处理后的词语,打上词性标签;
所述句子组装子单元,用于按照打上的词性标签,将所述分词处理后的词语与预先设置的提问句子语法进行比较,针对所述分词处理后的词语添加缺少的词语,组装成满足所述提问句子语法的提问。
所述提问发布单元具体从预先设置的模拟提问ID集合中选择一个ID,利用选择的ID模拟用户将所述提问形成单元形成的提问发布在知识问答平台的页面上;
所述模拟提问ID集合中的ID被所述知识问答平台默认为注册用户的ID。
所述知识获取单元具体从所述知识问答平台的页面上获取回答用户针对所述提问回答的相关知识信息,并从所述相关知识信息中确定出优质答案。
更进一步地,该装置还包括:
页面维护单元,用于在所述提问在所述知识问答平台的页面上的发布时长达到预设的关闭时长时,如果尚未出现所述提问的任何相关知识信息,或者尚未出现所述提问的优质答案,则关闭所述知识问答平台上所述提问所在的页面。
一种搜索相关知识信息的装置,该装置包括:上述获取相关知识信息的装置、用户交互单元和页面搜索单元;
所述用户交互单元,用于接收用户输入的query;
所述页面搜索单元,用于搜索与所述query的关键词相匹配的页面,如果从上述获取相关知识信息的装置在知识问答平台上发布提问所在的页面中,搜索到与所述query的关键词相匹配的页面,则将搜索到的页面包含在所述query的搜索结果中返回给用户。
由以上技术方案可以看出,本发明通过分析搜索日志挖掘出具有疑问需求的热点query,利用挖掘出的query形成提问并发布在知识问答平台的页面上,从而使得用户存在相关提问时,搜索引擎能够将知识问答平台上该提问所在的页面返回给用户,从而使得用户能够从该页面获取提问的相关知识信息。也就是说,通过本发明能够通过搜索引擎快速且准确地向提供知识问答平台上已有的相关知识信息,用户不必登陆知识问答平台发布提问,且等待该提问被解答才能获取相关知识信息。
【附图说明】
图1为本发明实施例一提供的获取相关知识信息的方法流程图;
图2为本发明实施例二提供的确定热点query的方法流程图;
图3为本发明实施例三提供的搜索相关知识信息的方法流程图;
图4为本发明实施例四提供的获取相关知识信息的装置结构图;
图5中的(a)、(b)和(c)为本发明实施例四提供的query挖掘单元的三种结构图;
图6为本发明实施例四提供的需求识别子单元的结构图;
图7为本发明实施例四提供的热点确定子单元的结构图;以及,
图8为本发明实施例五提供的搜索相关知识信息的装置结构图。
【具体实施方式】
为了使本发明的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本发明进行详细描述。
实施例一、
图1为本发明实施例一提供的获取相关知识信息的方法流程图,如图1所示,该方法可以包括以下步骤:
步骤101:分析搜索日志,挖掘出具有疑问需求的热点query。
可以对搜索日志进行周期性地分析,抓取当前周期内的搜索日志;然后利用抓取到的当前周期内的搜索日志挖掘具有疑问需求的热点query。其中,对搜索日志进行分析的周期可以灵活设置,例如,以天为周期,从当天的搜索日志中挖掘出具有疑问需求的热点query。
本步骤实际上分为两部分:一部分是识别出搜索日志中的query是否具有疑问需求;另一部分是确定热点query。这两部分的操作可以以任意的先后顺序执行,也可以并行同步执行,最终挖掘出具有疑问需求的热点query。即可以先识别出搜索日志中具有疑问需求的query,然后在具有疑问需求的query中确定热点query;也可以先确定出热点query,然后在热点query中识别出具有疑问需求的query;也可以分别同步确定出具有疑问需求的query和热点query,然后取两者的交集。
对query是否具有疑问需求进行识别的过程可以包括:将query进行基于语义的分词处理,将分词处理后得到的各词语分别在疑问属性数据库中进行匹配,确定各词语的疑问倾向分值;将各词语的疑问倾向分值相加后得到query的疑问倾向分值;如果query的疑问倾向分值超过预设的疑问需求阈值,则确定该query具有疑问需求;否则确定该query没有疑问需求。
其中,上述疑问属性数据库中存储有经过数据挖掘方式或者人工配置方式得到的各词语及其对应的疑问倾向分值。
疑问属性数据库中各词语对应的疑问倾向分值可以由但不限于以下因素决定:该词语是否为疑问词,该词语与疑问词之间的关联关系。例如,对于“哪些”、“什么”、“怎么”、“如何”、“为何”等疑问词,可以设定最高的疑问倾向分值;对于经常作为疑问词上下文的词语,例如“做法”、“方法”、“方式”等可以认为与疑问词具备较强的关联关系,可以设定较高的疑问倾向分值;对于其他与疑问词关联较小的词语,可以设定较小的疑问倾向分值。
举一个例子,对于用户输入的“鱼香肉丝 做法”的query,对其进行基于语义的分词处理后,得到“鱼香肉丝”和“做法”两个词语,将这两个词语在疑问属性数据库中进行匹配后,确定“鱼香肉丝”在疑问属性数据库中不存在匹配的词语,认为其疑问倾向分值为0,“做法”在疑问属性数据库中进行匹配后,确定其疑问倾向分值为70。将两者的疑问倾向分值相加后,确定query的疑问倾向分值为70,如果设定的疑问需求阈值为60分,则可以认为该query具有疑问需求。
确定热点query的过程将在实施例二中进行具体描述。
最终挖掘到的具有疑问需求的热点query可以在数据库中存储为一个文件。
步骤102:利用挖掘出的query形成提问并发布在知识问答平台的页面上。
本步骤中可以对挖掘出的query分别进行基于语义的分析和拼装,形成提问。
具体包括以下过程:
首先,将挖掘出的query进行基于语义的分词处理后的词语,打上词性标签。
然后,将这些词语与预先设置的提问句子语法进行比较,添加缺少的词语,从而形成满足提问句子语法的提问。
其中,提问句子语法可以灵活设置,只要满足常用的提问句法的要求即可。例如,设定提问的句子语法为:[形容词/名词+虚词]+名词+动词+疑问助词+疑问符号,其中[]表示可选项。如果某个query经过分词处理后得到的词语为名词和动词,则可以补上合适的疑问助词和疑问符号,最终拼装提问。
仍以“鱼香肉丝 做法”为例,对其进行分词处理后,将“鱼香肉丝”打上名词的标签,将“做法”打上名词的标签,然后,将其与预先定义的提问句子语法进行比较,添加上缺少的虚词、疑问助词和符号,形成的提问可以为“鱼香肉丝的做法是怎样的?”。
再例如,如果设定提问的句子语法为:名词+动词+疑问助词+名词+疑问符号,则上述“鱼香肉丝 做法”的query最终形成的提问可以是“鱼香肉丝具有哪些做法?”。
另外,由于某些现有的知识问答平台必须其注册用户才能够进行提问,知识问答平台通过ID对注册用户进行管理,为了适应于这种情况,可以提前预置模拟提问ID集合,该模拟提问ID集合中的ID都被知识问答平台默认为注册用户的ID,在将采用本发明实施例中方法形成的提问发布在知识问答平台的页面上时,可以从预置的模拟提问ID集合中选取未使用的ID进行发布,以模拟知识问答平台上的注册用户进行提问。
本发明中所涉及的提问并不限于普通的问题,也可以适用于其他形式的提问,例如:可以是询求某个文档的提问,此时该提问的相关知识信息可以是其他用户上传的文档。
步骤103:通过知识问答平台上的页面获取该提问的相关知识信息。
当提问在知识问答平台上发布后,由知识问答平台上的注册用户在提问的页面上进行回答来提供相关知识信息。
较优地,可以在页面上回答的相关知识信息中确定出优质答案,其中优质答案可以由知识问答平台的管理员参与确定,也可以由知识问答平台根据预设的优质答案选取策略自动确定。其中,优质答案选取策略可以由以下因素中的一种或任意组合决定:回答该提问的用户等级、用户所回答问题的采纳率、相关知识信息的长度等。
另外,提问在知识问答平台的页面发布后,如果发布时长达到预设的关闭时长尚未出现该提问的任何相关信息,或者尚未出现该提问的优质答案,则可以在知识问答平台上关闭该提问所在的页面。
至此实施例一所述的流程结束,下面结合实施例二对确定热点query的过程进行描述。
实施例二、
图2为本发明实施例二提供的确定热点query的流程图,如图2所示,该流程可以包括以下步骤:
步骤201:对query进行基于相关性的聚类得到各query组。
如果在搜索日志中识别具有疑问需求的query与确定热点query并行执行,或者,先确定热点query再从热点query中识别具有疑问需求的query,则本步骤的聚类对象为:抓取到的搜索日志中的query。
如果在搜索日志中先识别具有疑问需求的query,再在具有热点需求的query中确定热点query,则本步骤的聚类对象为:在搜索日志中识别出的具有疑问需求的query。
在进行聚类后,每一个query组中包含的各query具有较高的相关性,例如,对于“世界博览会”、“世博会”、“世博”这些query之间的相关性很高,满足聚类要求,则将这些query聚类为一个query中。
针对每一个query执行以下步骤202至步骤203。
步骤202:将query组中各query的搜索频次进行相加,确定整个query组的搜索频次。
根据搜索日志可以统计出各query的搜索频次,将每一个query组中各query的搜索频次相加,可以作为整个query组的搜索频次,反映出整个query组的热度。
步骤203:判断query组的搜索频次超过预设的热点频次,如果是,则执行步骤204;否则,确定该query组不是热点query组。
例如,对于“世界博览会”、“世博会”、“世博”构成的query组,假设“世界博览会”在设定时间内的搜索频次为1万次,“世博会”在设定时间内的搜索频次为2万次、“世博”在设定时间内的搜索频次为3万次,那么整个query组对应的设定时间内的搜索频次为6万次。如果预设的热点频次为5万次,则可以确定该query组是热点query组。
步骤204:确定该query组为热点query组,从该热点query组中选择一个query作为热点query。
从热点query组中选择一个热点query的策略可以包括但不限于以下策略:选择搜索频次最高的query,选择任意一个query,选择语义完整性最好的query等。
至此实施例二所示流程结束。下面结合实施例三对在图1所示方法的基础上,实现搜索相关知识信息的方法进行描述。
实施例三、
图3为本发明实施例三提供的搜索相关知识信息的方法流程图,如图3所示,搜索相关知识信息的方法可以包括以下步骤:
步骤301:接收用户输入的query。
步骤302:搜索与所述query的关键词相匹配的页面;其中如果搜索到所述知识问答平台上与所述query的关键词相匹配的页面,则将所述知识问答平台上与所述query的关键词相匹配的页面包含在所述query的搜索结果中返回给用户。
基于图1所示的流程,当搜索引擎接收到浏览器发送的用户输入的query后,在根据用户输入的query搜索页面时,由于后台已经预先按照图1所示的流程模拟用户提问并发布在知识问答平台的页面上,因此,搜索引擎从抓取到的页面中搜索与该query的关键词相匹配的页面时,能够匹配到知识问答平台上与该query的关键词相匹配的页面,该页面中已经包含相关的提问和针对该提问回答的相关知识信息。
也就是说,由于后台预先已经针对具有疑问需求的热点query进行了挖掘并形成提问在知识问答平台的页面上获取到了相关知识信息,因此,当用户通过在搜索引擎中输入query,搜索引擎就能够快速且准确地在搜索结果中反馈知识问答平台上已有的相关知识信息。
另外,还可以对知识问答平台的页面进行特殊处理,即允许搜索引擎抓取到知识问答平台上已经存在优质答案的页面,即如果在知识问答平台上的提问页面上尚未存在优质答案,则反馈给用户的搜索结果中则不会包含该提问页面。
以上是对本发明所提供的方法进行的详细描述,下面通过实施例四对本发明所提供的获取相关知识信息的装置进行详细描述。
实施例四、
图4为本发明实施例四提供的获取相关知识信息的装置结构图,如图4所示,该装置可以包括:query挖掘单元400、提问形成单元410、提问发布单元420和知识获取单元430。
query挖掘单元400,用于分析搜索日志,挖掘出具有疑问需求的热点query。
query挖掘单元400分析的搜索日志可以是周期性抓取到的搜索日志。
提问形成单元410,用于利用query挖掘单元400挖掘出的热点query形成提问。
提问发布单元420,用于将提问发布在知识问答平台的页面上。
知识获取单元430,用于通过知识问答平台的页面获取提问的相关知识信息。
其中,提问发布单元420和知识获取单元430可以是独立于知识问答平台的单元,也可以是设置在知识问答平台中的单元。
其中,query挖掘单元400的结构可以如图5所示,具体包括:需求识别子单元401和热点确定子单元402。
需求识别子单元401,用于从输入的query中识别出并输出具有疑问需求的query。
热点确定子单元402,用于从输入的query中确定并输出热点query。
其中,需求识别子单元401的输入可以为抓取到的搜索日志中的query,热点确定子单元402的输入为需求识别子单元401的输出,此时,热点确定子单元402的输出就是具有疑问需求的热点query。这种情况下需求识别子单元401和热点确定子单元402的连接关系如图5中的(a)所示。
或者,热点确定子单元402的输入为搜索日志的query,需求识别子单元401的输入为热点确定子单元402的输出,此时,需求识别子单元401的输出就是具有疑问需求的热点query。这种情况下需求识别子单元401和热点确定子单元402的连接关系如图5中的(b)所示。
再或者,需求识别子单元401的输入为抓取到的搜索日志中的query,热点确定子单元402的输入也为抓取到的搜索日志中的query,这种情况下需求识别子单元401和热点确定子单元402的连接关系如图5中的(c)所示,此时该装置还可以包括,将热点确定子单元402和需求识别子单元401取交集的子单元,即图5中(c)所示的交集处理子单元403,其输出就是具有疑问需求的热点query。
其中,需求识别子单元401的结构可以如图6所示,具体包括:分词处理模块601、词语打分模块602、query打分模块603和需求判断模块604。
分词处理模块601,用于将输入的query进行基于语义的分词处理。
词语打分模块602,用于将分词处理后的各词语分别在疑问属性数据库中进行匹配,确定各词语的疑问倾向分值。其中,疑问属性数据库中存储经过数据挖掘方式或者人工配置方式得到的各词语以及各词语对应的疑问倾向分值。
query打分模块603,用于将各词语的疑问倾向分值相加后,得到输入的query的疑问倾向分值。
需求判断模块,604用于判断输入的query的疑问倾向分值是否超过预设的疑问需求阈值,如果是,则确定输入的query具有疑问需求;否则确定输入的query没有疑问需求。
上述的词语对应的疑问倾向分值可以由但不限于以下因素决定:词语是否为疑问词,或者,词语与疑问词之间的关联关系。
另外,热点确定子单元402的结构可以如图7所示,具体包括:聚类处理模块701、频次统计模块702、热点组确定模块703和热点query确定模块704。
聚类处理模块701,用于对query进行基于相关性的聚类得到各query组。
频次统计模块702,用于将每一个query组中各query的搜索频次进行相加,确定每一个query组的搜索频次。
每一个query组的搜索频次,实际上体现了该query组的搜索热度,因此,热点组确定模块703,用于将搜索频次超过预设的热点频次的query组确定为热点query组。
热点query确定模块704,用于从每一个热点query组中选择一个query作为热点query。其中,从热点query组中选择热点query的策略可以包括但不限于:选择搜索频次最高的query,选择任意一个query,或者选择语义完整性最好的query等。
如图4所示,提问形成单元410可以具体包括:词性标识子单元411和句子组装子单元412。
词性标识子单元411,用于将query挖掘单元400挖掘出的热点query进行基于语义的分词处理后的词语,打上词性标签。
在此,词性标识子单元411可以本身具备分词处理的功能,即词性标识子单元411对query挖掘单元400挖掘出的热点query首先进行基于语义的分词处理,然后将分词处理后得到的词语打上词性标签。或者,词性标识子单元411可以不具备分词处理的功能,直接利用需求识别子单元401中分词处理模块601对该热点query的分词处理结果,将分词处理后得到的词语打上词性标签。
句子组装子单元412,用于按照打上的词性标签,将分词处理后得到的词语与预先设置的提问句子语法进行比较,针对分词处理后的词语添加缺少的词语,组装成满足提问句子语法的提问。
由于某些现有的知识问答平台必须其注册用户才能够进行提问,知识问答平台通过ID对注册用户进行管理,为了适应于这种情况,可以预先设置模拟提问ID集合,模拟提问ID集合中的ID被知识问答平台默认为注册用户的ID。此时的提问发布单元420可以从预先设置的模拟提问ID集合中选择一个ID,利用选择的ID模拟用户将提问形成单元410形成的提问发布在知识问答平台的页面上。
较优地,知识获取单元430可以从知识问答平台的页面上获取回答用户针对提问回答的相关知识信息,并进一步从相关知识信息中确定出优质答案。其中,优质答案可以由管理员参与确定,也可以由知识问答平台根据回答提问的用户等级、回答提问的用户的问题采纳率、相关知识信息的长度中的一种或组合确定。
另外,为了保证某些提问长时间没有被回答或者长时间没有出现优质答案而成为无效页面,该装置还包括:页面维护单元440,用于在提问在知识问答平台的页面上的发布时长达到预设的关闭时长时,如果尚未出现提问的任何相关知识信息,或者尚未出现提问的优质答案,则关闭知识问答平台上提问所在的页面。
实施例五、
图8为本发明实施例五提供的搜索相关知识信息的装置结构图,该装置包括:图4所示的装置、用户交互单元801和页面搜索单元802。
用户交互单元801,用于接收用户输入的query。
页面搜索单元802,用于搜索与query的关键词相匹配的页面,如果图4所示的装置在知识问答平台上发布提问所在的页面中,搜索到与query的关键词相匹配的页面,则将搜索到的页面包含在query的搜索结果中返回给用户。
也就是说,搜索引擎抓取的页面中也包含知识问答平台上提问所在的页面。
另外,还可以对知识问答平台的页面进行特殊处理,即允许页面搜索单元802搜索到知识问答平台上已经存在优质答案的页面,如果在知识问答平台上的提问页面尚未存在优质答案,则返回给用户的搜索结果中则不会包含该提问页面,即设置搜索引擎抓取不到知识问答平台上针对提问尚未出现优质答案的页面。
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明保护的范围之内。

Claims (20)

  1. 一种获取相关知识信息的方法,其特征在于,该方法包括:
    A、分析搜索日志,挖掘出具有疑问需求的热点搜索请求query;
    B、利用步骤A挖掘出的query形成提问并发布在知识问答平台的页面上;
    C、通过所述知识问答平台的页面获取所述提问的相关知识信息。
  2. 根据权利要求1所述的方法,其特征在于,所述步骤A具体包括:
    在所述搜索日志中识别出具有疑问需求的query,在具有疑问需求的query中确定热点query;或者,
    在所述搜索日志中确定热点query,在确定出的热点query中识别出具有疑问需求的query;或者,
    在所述搜索日志中识别出具有疑问需求的query,并且在所述搜索日志中确定热点query,将识别出的具有疑问需求的query和确定的热点query取交集。
  3. 根据权利要求1所述的方法,其特征在于,所述识别出具有疑问需求的query具体包括:
    将query进行基于语义的分词处理;
    将分词处理后得到的各词语分别在疑问属性数据库中进行匹配,确定各词语的疑问倾向分值;
    将所述各词语的疑问倾向分值相加后,得到query的疑问倾向分值;
    判断所述query的疑问倾向分值是否超过预设的疑问需求阈值,如果是,则确定该query具有疑问需求;否则确定该query没有疑问需求;
    其中,所述疑问属性数据库中存储经过数据挖掘方式或者人工配置方式得到的各词语以及各词语对应的疑问倾向分值。
  4. 根据权利要求3所述的方法,其特征在于,所述词语对应的疑问倾向分值由以下因素决定:
    词语是否为疑问词,或者,词语与疑问词之间的关联关系。
  5. 根据权利要求2所述的方法,其特征在于,所述确定热点query具体包括:
    对query进行基于相关性的聚类得到各query组;
    将每一个query组中各query的搜索频次进行相加,确定每一个query组的搜索频次;
    将搜索频次超过预设的热点频次的query组确定为热点query组;
    从所述热点query组中选择一个query作为热点query。
  6. 根据权利要求1所述的方法,其特征在于,步骤B中所述利用步骤A挖掘出的query形成提问具体包括:
    将挖掘出的query进行基于语义的分词处理后的词语,打上词性标签;
    按照打上的词性标签,将所述分词处理后的词语与预先设置的提问句子语法进行比较,针对分词处理后的词语添加缺少的词语,组装成满足所述提问句子语法的提问。
  7. 根据权利要求1所述的方法,其特征在于,将所述提问发布在知识问答平台的页面上具体包括:
    从预先设置的模拟提问ID集合中选择一个ID,利用该ID模拟用户将所述提问发布在知识问答平台的页面上;所述模拟提问ID集合中的ID被所述知识问答平台默认为注册用户的ID。
  8. 根据权利要求1所述的方法,其特征在于,所述步骤C具体包括:
    从所述知识问答平台的页面上获取回答用户针对所述提问回答的相关知识信息,并从所述相关知识信息中确定出优质答案。
  9. 根据权利要求8所述的方法,其特征在于,该方法还包括:
    如果所述提问在所述知识问答平台的页面上的发布时长达到预设的关闭时长时,尚未出现所述提问的任何相关知识信息,或者尚未出现所述提问的优质答案,则关闭所述知识问答平台上所述提问所在的页面。
  10. 一种搜索相关知识信息的方法,其特征在于,该方法基于权利要求1所述的获取相关知识信息的方法,所述搜索相关知识信息的方法包括:
    接收用户输入的query;
    搜索与所述query的关键词相匹配的页面;其中如果搜索到知识问答平台上与所述query的关键词相匹配的页面,则将所述知识问答平台上与所述query的关键词相匹配的页面包含在所述query的搜索结果中返回给用户。
  11. 一种获取相关知识信息的装置,其特征在于,该装置包括:搜索请求query挖掘单元、提问形成单元、提问发布单元和知识获取单元;
    所述query挖掘单元,用于分析搜索日志,挖掘出具有疑问需求的热点query;
    所述提问形成单元,用于利用所述query挖掘单元挖掘出的热点query形成提问;
    所述提问发布单元,用于将所述提问发布在知识问答平台的页面上;
    所述知识获取单元,用于通过所述知识问答平台的页面获取所述提问的相关知识信息。
  12. 根据权利要求11所述的装置,其特征在于,所述query挖掘单元具体包括:需求识别子单元和热点确定子单元;
    所述需求识别子单元,用于从输入的query中识别出并输出具有疑问需求的query;
    所述热点确定子单元,用于从输入的query中确定并输出热点query;
    其中,所述需求识别子单元的输入为搜索日志中的query,所述热点确定子单元的输入为所述需求识别子单元的输出,所述热点确定子单元的输出为所述具有疑问需求的热点query;
    或者,热点确定子单元的输入为搜索日志的query,所述需求识别子单元的输入为所述热点确定子单元的输出,所述需求识别子单元的输出为所述具有疑问需求的热点query;或者,
    所述需求识别子单元的输入为搜索日志中的query,所述热点确定子单元的输入也为搜索日志中的query,此时该装置还包括:交集处理子单元,用于将所述热点确定子单元和所述需求识别子单元取交集,输出具有疑问需求的热点query。
  13. 根据权利要求12所述的装置,其特征在于,所述需求识别子单元具体包括:分词处理模块、词语打分模块、query打分模块和需求判断模块;
    所述分词处理模块,用于将输入的query进行基于语义的分词处理;
    所述词语打分模块,用于将分词处理后得到的各词语分别在疑问属性数据库中进行匹配,确定各词语的疑问倾向分值;
    所述query打分模块,用于将所述各词语的疑问倾向分值相加后,得到所述输入的query的疑问倾向分值;
    所述需求判断模块,用于判断所述输入的query的疑问倾向分值是否超过预设的疑问需求阈值,如果是,则确定所述输入的query具有疑问需求;否则确定所述输入的query没有疑问需求;
    其中,所述疑问属性数据库中存储经过数据挖掘方式或者人工配置方式得到的各词语以及各词语对应的疑问倾向分值。
  14. 根据权利要求13所述的装置,其特征在于,所述词语对应的疑问倾向分值由以下因素决定:
    词语是否为疑问词,或者,词语与疑问词之间的关联关系。
  15. 根据权利要求12所述的装置,其特征在于,所述热点确定子单元具体包括:聚类处理模块、频次统计模块、热点组确定模块和热点query确定模块;
    所述聚类处理模块,用于对query进行基于相关性的聚类得到各query组;
    所述频次统计模块,用于将每一个query组中各query的搜索频次进行相加,确定每一个query组的搜索频次;
    所述热点组确定模块,用于将搜索频次超过预设的热点频次的query组确定为热点query组;
    所述热点query确定模块,用于从每一个热点query组中选择一个query作为热点query。
  16. 根据权利要求11所述的装置,其特征在于,所述提问形成单元具体包括:词性标识子单元和句子组装子单元;
    所述词性标识子单元,用于将所述query挖掘单元挖掘出的query进行基于语义的分词处理后的词语,打上词性标签;
    所述句子组装子单元,用于按照打上的词性标签,将所述分词处理后的词语与预先设置的提问句子语法进行比较,针对所述分词处理后的词语添加缺少的词语,组装成满足所述提问句子语法的提问。
  17. 根据权利要求11所述的装置,其特征在于,所述提问发布单元具体从预先设置的模拟提问ID集合中选择一个ID,利用选择的ID模拟用户将所述提问形成单元形成的提问发布在知识问答平台的页面上;
    所述模拟提问ID集合中的ID被所述知识问答平台默认为注册用户的ID。
  18. 根据权利要求11所述的装置,其特征在于,所述知识获取单元具体从所述知识问答平台的页面上获取回答用户针对所述提问回答的相关知识信息,并从所述相关知识信息中确定出优质答案。
  19. 根据权利要求18所述的装置,其特征在于,该装置还包括:
    页面维护单元,用于在所述提问在所述知识问答平台的页面上的发布时长达到预设的关闭时长时,如果尚未出现所述提问的任何相关知识信息,或者尚未出现所述提问的优质答案,则关闭所述知识问答平台上所述提问所在的页面。
  20. 一种搜索相关知识信息的装置,其特征在于,该装置包括:权利要求11所述的装置、用户交互单元和页面搜索单元;
    所述用户交互单元,用于接收用户输入的query;
    所述页面搜索单元,用于搜索与所述query的关键词相匹配的页面,如果从权利要求11所述的装置在知识问答平台上发布提问所在的页面中,搜索到与所述query的关键词相匹配的页面,则将搜索到的页面包含在所述query的搜索结果中返回给用户。
PCT/CN2012/073234 2011-03-31 2012-03-29 获取和搜索相关知识信息的方法及装置 Ceased WO2012130145A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2014501426A JP5780617B2 (ja) 2011-03-31 2012-03-29 関連知識情報を獲得・検索する方法及び装置

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201110081274.4A CN102737022B (zh) 2011-03-31 2011-03-31 获取和搜索相关知识信息的方法及装置
CN201110081274.4 2011-03-31

Publications (1)

Publication Number Publication Date
WO2012130145A1 true WO2012130145A1 (zh) 2012-10-04

Family

ID=46929469

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2012/073234 Ceased WO2012130145A1 (zh) 2011-03-31 2012-03-29 获取和搜索相关知识信息的方法及装置

Country Status (3)

Country Link
JP (1) JP5780617B2 (zh)
CN (1) CN102737022B (zh)
WO (1) WO2012130145A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182193A (zh) * 2020-10-19 2021-01-05 山东旗帜信息有限公司 一种交通行业中日志获取方法、设备及介质

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103870457A (zh) * 2012-12-07 2014-06-18 北京百度网讯科技有限公司 一种确定问答平台中的未回答问题优先级的方法及装置
EP3063664A4 (en) * 2013-10-31 2017-07-05 Longsand Limited Topic-wise collaboration integration
CN105991399A (zh) * 2015-02-05 2016-10-05 天脉聚源(北京)科技有限公司 一种实现网络提问的方法和系统
CN104899322B (zh) * 2015-06-18 2021-09-17 百度在线网络技术(北京)有限公司 搜索引擎及其实现方法
JP6566810B2 (ja) * 2015-09-18 2019-08-28 株式会社ユニバーサルエンターテインメント 商業用情報提供システムおよび商業用情報提供方法
CN107688641B (zh) * 2017-08-28 2021-12-28 江西博瑞彤芸科技有限公司 一种提问管理方法及系统
CN109886733A (zh) * 2019-01-25 2019-06-14 平安科技(深圳)有限公司 信息推荐方法、存储介质及计算机设备
CN117235242B (zh) * 2023-11-15 2024-02-06 浙江力石科技股份有限公司 一种基于智能问答数据库的热点信息筛选方法及系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101093509A (zh) * 2007-07-18 2007-12-26 中国科学院计算技术研究所 一种查询交互系统和方法
CN101261690A (zh) * 2008-04-18 2008-09-10 北京百问百答网络技术有限公司 一种问题自动生成的系统及其方法
CN101751454A (zh) * 2009-12-12 2010-06-23 浙江大学 一种基于概率潜在语义分析的网络答案选择方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06301577A (ja) * 1993-04-12 1994-10-28 Fujitsu Ltd データベース装置
JP3908634B2 (ja) * 2002-09-11 2007-04-25 株式会社東芝 検索支援方法および検索支援装置
JP4512826B2 (ja) * 2005-03-03 2010-07-28 国立大学法人 筑波大学 質問応答システム
US8983977B2 (en) * 2006-03-01 2015-03-17 Nec Corporation Question answering device, question answering method, and question answering program
US20080104065A1 (en) * 2006-10-26 2008-05-01 Microsoft Corporation Automatic generator and updater of faqs
JP4860439B2 (ja) * 2006-11-08 2012-01-25 ヤフー株式会社 質問文の自動生成システム
JP2010282403A (ja) * 2009-06-04 2010-12-16 Kansai Electric Power Co Inc:The 文書検索方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101093509A (zh) * 2007-07-18 2007-12-26 中国科学院计算技术研究所 一种查询交互系统和方法
CN101261690A (zh) * 2008-04-18 2008-09-10 北京百问百答网络技术有限公司 一种问题自动生成的系统及其方法
CN101751454A (zh) * 2009-12-12 2010-06-23 浙江大学 一种基于概率潜在语义分析的网络答案选择方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182193A (zh) * 2020-10-19 2021-01-05 山东旗帜信息有限公司 一种交通行业中日志获取方法、设备及介质
CN112182193B (zh) * 2020-10-19 2023-01-13 山东旗帜信息有限公司 一种交通行业中日志获取方法、设备及介质

Also Published As

Publication number Publication date
CN102737022A (zh) 2012-10-17
JP5780617B2 (ja) 2015-09-16
JP2014512600A (ja) 2014-05-22
CN102737022B (zh) 2015-01-07

Similar Documents

Publication Publication Date Title
WO2012130145A1 (zh) 获取和搜索相关知识信息的方法及装置
WO2017150860A1 (en) Predicting text input based on user demographic information and context information
WO2010068068A2 (ko) 사용자의 의도에 기반한 정보 검색방법 및 정보 제공방법
WO2020155359A1 (zh) 家电设备的控制方法、服务器、家电设备及存储介质
WO2016133319A1 (en) Method and device for providing information
WO2022196956A1 (ko) 삼중말뭉치를 이용한 딥러닝 트랜스포머 번역 시스템
WO2017041484A1 (zh) 一种实时信息的推荐方法、装置和系统
WO2020009297A1 (ko) 도메인 추출기반의 언어 이해 성능 향상장치및 성능 향상방법
WO2015129989A1 (ko) 콘텐츠 및 음원 추천 장치 및 방법
WO2019041856A1 (zh) 家电控制方法、系统、控制终端、及存储介质
WO2018066942A1 (en) Electronic device and method for controlling the same
WO2012091360A2 (ko) 유저 맞춤형 컨텐츠 제공 방법 및 시스템
WO2012134180A2 (ko) 문장에 내재한 감정 분석을 위한 감정 분류 방법 및 컨텍스트 정보를 이용한 다중 문장으로부터의 감정 분류 방법
WO2016003219A1 (en) Electronic device and method for providing content on electronic device
WO2015174743A1 (en) Display apparatus, server, system and information-providing methods thereof
WO2010050675A2 (ko) 의존 문법 구문 트리의 탐색을 통한 자동 관계 트리플 추출 방법
WO2019177182A1 (ko) 속성 정보 분석을 통한 멀티미디어 컨텐츠 검색장치 및 검색방법
WO2018076840A1 (zh) 数据分享方法、装置、存储介质及服务器
WO2012165709A1 (ko) 인스턴스 경로 탐색 및 시각화 방법 및 장치
WO2023229376A1 (ko) 실시간 음성 상담 지원을 위한 지능형 답변 추천 시스템 및 그 방법
WO2018084326A1 (ko) 실시간 상담을 제공하기 위한 방법 및 서버
WO2011155736A2 (ko) 모든 자연어 표현의 각각의 의미마다 별도의 용어를 동적으로 생성하는 방법 및 이를 기반으로 하는 사전 관리기,문서작성기, 용어 주석기, 검색 시스템 및 문서정보체계 구축장치
WO2015027679A1 (zh) 互联网寻址方法和装置
WO2016127459A1 (zh) 智能交互系统中未登录词的识别方法和装置
WO2017191877A1 (ko) 프로버넌스 관리를 위한 압축 장치 및 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12765085

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2014501426

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12765085

Country of ref document: EP

Kind code of ref document: A1