[go: up one dir, main page]

CN121071128A - A method, apparatus, electronic device, and storage medium for extracting search keywords. - Google Patents

A method, apparatus, electronic device, and storage medium for extracting search keywords.

Info

Publication number
CN121071128A
CN121071128A CN202410702983.7A CN202410702983A CN121071128A CN 121071128 A CN121071128 A CN 121071128A CN 202410702983 A CN202410702983 A CN 202410702983A CN 121071128 A CN121071128 A CN 121071128A
Authority
CN
China
Prior art keywords
document
search
keywords
keyword
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410702983.7A
Other languages
Chinese (zh)
Inventor
金元浩
蔡雄
褚宏鑫
李悦
齐佳斌
张�杰
彭跃
龙凯
张杨
魏淑平
刘东阳
杜啸楠
李明海
周楠
裴亚琳
马勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202410702983.7A priority Critical patent/CN121071128A/en
Publication of CN121071128A publication Critical patent/CN121071128A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请涉及数据处理技术领域,尤其涉及一种检索关键词的提取方法、装置、电子设备及存储介质,该方法为:先获取从检索文本中提取的用于指示一个检索需要的一组基础关键词,再获取对应检索数据库中的各候选文档构建的核心关键词集,进而,通过计算一组基础关键词对应的整体文本特征,与各核心关键词的特征相似情况,在各核心关键词中筛选出各候选关键词,再将至少一个基础关键词,以及筛选出的各候选关键词,作为各检索关键词。这样,在针对一种检索需求确定的各检索关键词中,既包括直接从检索文本中提取的至少一个基础关键词,又包括在检索关键词集中筛选出的候选关键词,相当于在检索文本的基础上,扩展确定了更多的检索关键词。

This application relates to the field of data processing technology, and in particular to a method, apparatus, electronic device, and storage medium for extracting search keywords. The method involves: first, obtaining a set of basic keywords extracted from the search text to indicate a search requirement; then, obtaining a core keyword set constructed from candidate documents in the corresponding search database; next, calculating the overall text features corresponding to the set of basic keywords and their similarity to the features of each core keyword; selecting candidate keywords from among the core keywords; and finally, using at least one basic keyword and each selected candidate keyword as the search keywords. Thus, the search keywords determined for a specific search requirement include both at least one basic keyword directly extracted from the search text and candidate keywords selected from the search keyword set, effectively expanding the search keywords beyond the search text to include more search keywords.

Description

Retrieval keyword extraction method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and apparatus for extracting a search keyword, an electronic device, and a storage medium.
Background
In the prior art, when text data is searched, a search keyword is generally extracted from a search text, and then a matched target document is determined from candidate documents according to the search keyword.
For example, it is now possible to determine a target document matching the search text by calculating a TF-IDF value corresponding to the search keyword among candidate documents by means of a Term Frequency-inverse document Frequency (TF-IDF) algorithm.
However, at present, when extracting a search keyword, word segmentation extraction can only be performed according to a general word segmentation mode, so that a good word segmentation effect can only be obtained on common words, and the extraction effect on rare special words in the appointed field is very poor, so that the search keyword cannot be accurately obtained from a search text, further effective text search cannot be performed, and the text search efficiency is greatly reduced.
Disclosure of Invention
The embodiment of the application provides a method, a device, electronic equipment and a storage medium for extracting search keywords, which are used for extracting accurate search keywords from a search text.
In a first aspect, a method for extracting a search keyword is provided, including:
acquiring a group of basic keywords extracted based on a search text, wherein the group of basic keywords comprises at least one basic keyword for indicating a search requirement;
Extracting document keywords with importance meeting a set condition from each candidate document, merging document keywords with continuous content by counting coexistence situations and word distances of every two document keywords in a single candidate document, and constructing the core keyword set based on each processed document keyword, wherein the importance is determined according to word frequency of the corresponding document keyword in the extracted candidate document and occurrence situations in each candidate document;
according to the similarity between the overall text features extracted by the group of basic keywords and the word features extracted by the core keywords, determining candidate keywords meeting preset screening conditions in the core keywords;
And determining the at least one basic keyword and the candidate keywords as search keywords.
In a second aspect, an extracting device for a search keyword is provided, including:
A first obtaining unit, configured to obtain a set of basic keywords extracted based on a search text, where the set of basic keywords includes at least one basic keyword for indicating a search requirement;
The second acquisition unit is used for acquiring a core keyword set constructed by each candidate document in the corresponding search database, wherein the core keyword set is obtained by respectively extracting document keywords with importance meeting a set condition from each candidate document, combining document keywords with continuous content by counting coexistence situations and word distances of every two document keywords in a single candidate document, and constructing the core keyword set based on each processed document keyword, and the importance is determined according to word frequency of the corresponding document keyword in the extracted candidate document and occurrence situations in each candidate document;
the screening unit is used for determining each candidate keyword which accords with a preset screening condition in each core keyword according to the similarity condition between the whole text characteristics extracted by the group of basic keywords and the word characteristics extracted by the corresponding core keywords;
and the determining unit is used for determining the at least one basic keyword and the candidate keywords as search keywords.
Optionally, after the determining the at least one basic keyword and the candidate keywords as the search keywords, a search unit in the apparatus is configured to:
in the search database, according to preset M search modes, combining the search keywords, and searching and determining each initial document respectively associated with at least one search keyword;
Obtaining a constructed document structure diagram, wherein the document structure diagram comprises nodes which are respectively constructed corresponding to various document data blocks in each candidate document and connecting edges used for representing a content link relation and a content attribution relation;
and clustering the initial documents with the link relation according to the document structure diagram to obtain target document sets of various associated search keywords, and respectively generating search results based on the target document sets.
Optionally, when the searching unit searches and determines each initial document associated with at least one search keyword according to a plurality of preset searching modes and by combining the search keywords in the searching database, the searching unit is configured to perform any one of the following operations:
in the search database, according to preset M search modes, combining the search keywords, and after M search processes are executed in series, obtaining initial documents which are determined by the last search process and are respectively associated with at least one search keyword;
In the search database, M search processes are executed in parallel according to M preset search modes and in combination with the search keywords respectively, at least one matching document of the associated content matching degree determined in each search process is obtained, and each initial document is screened out from the matching documents based on the at least one content matching degree associated with each matching document.
Optionally, the M search modes are obtained by the search unit in at least one of the following modes:
acquiring each self-defined retrieval logic, and determining each retrieval mode according to each retrieval logic;
and acquiring each search module defined by an external system, and acquiring corresponding search modes by loading each search module.
Optionally, in any one of the search processes except the last search process, the search unit performs the following operations:
determining each candidate document of the retrieval basis and a retrieval mode of the basis;
determining each matching document and the corresponding content matching degree in each candidate document according to each search keyword by adopting the search mode;
And respectively calculating the accumulated value of the content matching degree up to the current retrieval process aiming at each candidate document according to the basis, screening out the specified number of target documents with the highest accumulated value as each candidate document according to the next retrieval process, wherein the specified number is determined according to the execution sequence of the retrieval process.
Optionally, when each initial document is screened out from each matching document based on at least one content matching degree associated with each matching document, the search unit is configured to:
Determining at least one content matching degree associated with one matching document, determining target retrieval modes corresponding to the at least one content matching degree, and determining a matching degree fusion value corresponding to the one matching document according to mode weights respectively preset for the M retrieval modes and combining the content matching degrees under at least one target retrieval mode;
and determining each matching document with the corresponding matching degree fusion value reaching the preset condition as each initial document.
Optionally, when the initial documents with the link relationship are clustered according to the document structure diagram, the search unit is configured to include:
for each initial document, the following is performed:
determining at least one hit keyword contained in one initial document among the search keywords;
For each hit keyword, determining a target node to which the hit keyword belongs in the document structure diagram, and clustering the one initial document and other initial documents when determining that the target node and a child node belonging to the target node have connecting edges linked to the other initial documents.
Optionally, before the acquiring the set of basic keywords extracted based on the search text, the first acquiring unit is configured to:
Responding to a search request triggered by a target object aiming at a search text, analyzing the search text to obtain each basic keyword and semantic roles corresponding to each basic keyword, wherein the semantic roles are used for indicating predicates and various modifier words associated with the predicates;
according to the semantic roles of the basic keywords, basic keywords of the corresponding predicates are determined, and aiming at the basic keywords of each corresponding predicate, the following operation is carried out, namely, a corresponding basic keyword phrase is constructed based on the basic keyword corresponding to one predicate and the basic keyword used for modifying the one predicate.
Optionally, before the obtaining a set of basic keywords extracted based on the search text, the second obtaining unit constructs a core keyword set in the following manner:
extracting document keywords with importance meeting set conditions from the candidate documents respectively;
dividing every two coexisting document keywords in a single candidate document into a document keyword group;
counting two corresponding document keywords, determining an average word distance when the two document keywords appear in each candidate document, and merging the two document keywords into one document keyword when the two document keywords are determined to be continuous based on the average word distance;
And constructing a core keyword set based on the processed keywords of each document.
Optionally, when the core keyword set is constructed based on the processed keywords of each document, the second obtaining unit is configured to:
for each processed document keyword, performing the operations of counting the total number of other keywords coexisting with one document keyword in a single candidate document among the processed document keywords;
Taking the processed document keywords as each graph node, and after establishing directed connection edges between graph nodes with differences in total numbers of other related keywords, obtaining a constructed keyword graph;
Combining each three graph nodes meeting a combination condition into graph nodes corresponding to corresponding combined keywords in the keyword graph, wherein the combination condition is that the preset three-node structure is met, and the combined keywords obtained by the corresponding three document keywords exist in each candidate document;
And constructing a core keyword set based on keywords corresponding to each graph node in the keyword graph.
Optionally, the three-node structure includes any one of the following:
two graph nodes point to one and the same graph node;
the two graph nodes point to one and the same graph node, and a connecting edge exists between the two graph nodes.
Optionally, the importance of a document keyword is determined by the second obtaining unit in the following manner:
calculating initial word frequency of one document keyword in a corresponding candidate document and inverse document frequency of the one document keyword in each candidate document;
And obtaining a target word frequency after the initial word frequency is subjected to value limiting, and taking the product result of the target word frequency and the inverse document frequency as the value of the importance of the document keyword.
In a third aspect, an electronic device is presented comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above method when executing the computer program.
In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, implements the above method.
In a fifth aspect, a computer program product is proposed, comprising a computer program which, when executed by a processor, implements the above method.
The application has the following beneficial effects:
When extracting a search keyword aiming at a search requirement determined from a search text, firstly acquiring a group of basic keywords extracted from the search text and used for indicating the search requirement, and then acquiring a core keyword set constructed by corresponding candidate documents in a search database, wherein each core keyword in the core keyword set is obtained by respectively extracting document keywords with importance meeting a set condition from each candidate document, and merging the document keywords with continuous content from all the extracted document keywords; the method has the advantages that the document keywords with continuous contents are combined, so that unusual rarely-used words can be integrated, adverse effects caused by inaccurate word segmentation can be overcome, and the accuracy and the referenceof each determined core keyword can be improved;
Furthermore, by calculating the feature similarity between the overall text features corresponding to a group of basic keywords and the word features of each core keyword, each candidate keyword is screened out from each core keyword, so that the correlation between the core keyword and the search requirement can be evaluated on the whole, the expansion determination of the candidate keywords with very high matching degree with the search requirement is facilitated, the extraction accuracy of the search keywords is improved, at least one basic keyword in the group of basic keywords and each screened candidate keyword is used as each search keyword, and therefore, in each search keyword determined for one search requirement, at least one basic keyword directly extracted from a search text is included, and the candidate keywords screened out from the search keyword set are also included.
Drawings
Fig. 1 is a schematic diagram of a possible application scenario in an embodiment of the present application;
FIG. 2 is a schematic diagram of a retrieval keyword extraction process in an embodiment of the present application;
FIG. 3A is a schematic diagram of a process for constructing a core keyword set according to an embodiment of the present application;
FIG. 3B is a schematic diagram of a process for determining document keywords for each candidate document according to an embodiment of the application;
FIG. 3C is a schematic diagram illustrating a process for determining a keyword group of a document according to an embodiment of the present application;
FIG. 3D is a schematic diagram of a process for determining an average word distance according to an embodiment of the present application;
FIG. 3E is a diagram illustrating another process for determining an average word distance according to an embodiment of the present application;
FIG. 4A is a schematic diagram of another process for constructing a core keyword set according to an embodiment of the present application;
FIG. 4B is a schematic diagram of a preset three-node structure according to an embodiment of the present application;
FIG. 4C is a schematic diagram of a process for screening candidate keywords according to an embodiment of the application;
FIG. 5A is a schematic diagram of a retrieval process based on each retrieval keyword according to an embodiment of the present application;
FIG. 5B is a schematic diagram of a process for serially performing a search process according to an embodiment of the present application;
FIG. 5C is a schematic diagram of a parallel search process according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a system according to an embodiment of the present application;
FIG. 7A is a schematic diagram illustrating a process of searching by the auxiliary customer service system according to an embodiment of the present application;
FIG. 7B is a diagram illustrating a retrieval process according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a logic structure of an extracting device for a search keyword according to an embodiment of the present application;
fig. 9 is a schematic diagram of a hardware composition structure of an electronic device to which the embodiment of the present application is applied;
fig. 10 is a schematic diagram of a hardware composition structure of another electronic device to which the embodiment of the present application is applied.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the technical solutions of the present application, but not all embodiments. All other embodiments, based on the embodiments described in the present document, which can be obtained by a person skilled in the art without any creative effort, are within the scope of protection of the technical solutions of the present application.
The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be capable of operation in sequences other than those illustrated or otherwise described.
In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function and working together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.
Some terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.
Word Frequency-inverse document Frequency (TF-IDF), which is a weight calculation method widely used in the fields of information retrieval and text mining, evaluates the importance of a word or phrase in a document by calculating the product of the word Frequency of text in the document and the inverse document Frequency. The TF-IDF is composed of two parts, namely Term Frequency (TF) used for reflecting the Frequency of occurrence of a word in a certain document and usually determined by the ratio of the number of occurrences of the word in the document to the total number of occurrences of the document, and inverse document Frequency (Inverse Document Frequency, IDF) used for measuring the popularity of the word, and the corresponding IDF is high if the word is rare.
Drooping class refers to a specific scene or a specific field.
The searching refers to the process of searching the document content matched with the search keyword from the search database.
The application relates to a search database, which is constructed before search and comprises all searchable contents, and in the embodiment of the application, various types of documents can be stored in the search database under the condition of searching text contents.
The search keyword is used in the search process and is used for matching the words of the target content to be searched.
The retrieval text is the text content which is edited by the object triggering the retrieval request and used for expressing the retrieval requirement, wherein the retrieval text comprises at least one retrieval requirement.
The core keyword set is a word set combined by the core keywords, and the core keywords are keywords which are extracted from candidate documents in a search database, have importance meeting requirements and are obtained by combining continuous contents.
Artificial intelligence ARTIFICIAL INTELLIGENCE, AI is a theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
The following briefly describes the design concept of the embodiment of the present application:
In the prior knowledge base searching scheme, when text searching is carried out, a searching keyword is generally directly extracted from a searching text, then according to the searching keyword, TF-IDF values corresponding to the searching keyword are respectively determined in candidate documents of a searching database, so that target documents most relevant to the searching keyword are searched in the candidate documents, in other words, the searching scheme based on the TF-IDF algorithm is to measure the importance degree of words in the documents by taking the TF-IDF values as indexes, and then match the TF-IDF values with the searching keyword, so that the documents most relevant to the searching keyword can be searched.
However, at present, when a search keyword is extracted, only a general word segmentation mode can be adopted for processing, so that a good word segmentation effect can be obtained only on common words, the extraction effect on rare special words in the appointed field is very poor, the search keyword cannot be accurately obtained from a search text, and therefore document content meeting the search requirement cannot be accurately searched, and the search effect is greatly reduced.
In view of this, when a search needs to be determined from a search text is extracted, a group of basic keywords which are extracted from the search text and used for indicating a search need are firstly obtained, and then a core keyword set constructed corresponding to each candidate document in a search database is obtained, wherein each core keyword in the core keyword set is obtained by respectively extracting document keywords with importance meeting a set condition from each candidate document and combining the document keywords with continuous content from all the extracted document keywords;
Furthermore, by calculating the feature similarity between the overall text features corresponding to a group of basic keywords and the word features of each core keyword, each candidate keyword is screened out from each core keyword, so that the correlation between the core keyword and the search requirement can be evaluated on the whole, the expansion determination of the candidate keywords with very high matching degree with the search requirement is facilitated, the extraction accuracy of the search keywords is improved, at least one basic keyword in the group of basic keywords and each screened candidate keyword is used as each search keyword, and therefore, in each search keyword determined for one search requirement, at least one basic keyword directly extracted from a search text is included, and the candidate keywords screened out from the search keyword set are also included.
The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and not for limitation of the present application, and that the embodiments of the present application and the features of the embodiments may be combined with each other without conflict.
Fig. 1 is a schematic diagram of a possible application scenario in an embodiment of the present application. The application scenario schematic includes a retrieval device 110, a service device 120, and a client device 130.
In some possible embodiments of the present application, the search device 110 may respond to a search request sent by the service device 120 to obtain a search text carried in the search request, where the search request is triggered by a target object on the client device 130, and then, after determining a search keyword based on the search text in the determination manner of the search keyword provided by the present application, search and screen each candidate document included in the search database according to the search keyword, and integrate each searched target document to obtain each set of search results, and send each set of search results to the service device 120, so that the service device 120 presents one or more sets of search results on the client device 120 according to actual presentation needs, where one set of search results includes a target document determined corresponding to at least one search keyword respectively.
In other possible embodiments of the present application, the search device 110 may respond to the search request sent by the client device 130 to obtain the search text carried in the search request, and then, after the determination method of the search keyword provided by the present application is adopted to determine the extraction of the search keyword based on the search text, search and screen are performed according to the search keyword in each candidate document included in the search database, and each searched target document is integrated to obtain each set of search results, and further, one or more sets of search results are presented on the client device 130 according to the actual presentation requirement, where one set of search results includes the target document determined corresponding to at least one search keyword.
The search request sent by the client device 130 may be initiated by any one of an applet application, a client application, and a web application on the client device 130, which is not particularly limited in the present application.
The search device 110 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like.
The service device 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligent platforms, and the like.
Client devices 130 include, but are not limited to, cell phones, tablet computers, notebooks, electronic book readers, intelligent voice interaction devices, intelligent appliances, vehicle terminals, aircraft, and the like.
In the embodiment of the application, the search device 110 and the service device 120 and the search device 110 and the client device 130 can communicate through a wired network or a wireless network under the condition that the search device 110 directly interacts with the service device 120, and the search device 110 and the client device 130 can communicate through a wired network or a wireless network under the condition that the search device 110 directly interacts with the client device 130. In the following description, the extraction process of the search keyword and the search process based on the search keyword will be described only from the viewpoint of the search apparatus 110.
The following is a schematic description in connection with possible extraction scenarios of search keywords:
And (3) in the first application scene, in the vertical knowledge base, searching the technical document.
After the vertical knowledge base is built based on the internal technical documents, the retrieval equipment can firstly extract the document keywords with importance meeting the set conditions from the technical documents included in the vertical knowledge base, then combine the document keywords with continuous content in all the document keywords, and further use the processed document keywords as core keywords to build a core keyword set.
The search device can respond to a search request triggered by the intranet device based on the search text, extracts corresponding basic keywords aiming at each search requirement in the search text, and then screens out candidate keywords from the core keywords by calculating the feature similarity between the overall text features corresponding to the basic keywords and the word features of the core keywords, and further determines the basic keywords and the candidate keywords included in the basic keywords as the search keywords.
And then, searching in each technical document by adopting a conventional searching mode according to each searching keyword, or searching according to the searching mode provided by the application, and finally determining a matched target document.
And secondly, the application scene is used for assisting intelligent customer service application, and technical document retrieval is carried out in the vertical class knowledge base file.
After the vertical knowledge base is constructed based on the technical documents in the appointed field, the retrieval equipment can firstly extract the document keywords with importance meeting the set condition from the technical documents included in the vertical knowledge base, then combine the document keywords with continuous content in all the document keywords, and further respectively use the processed document keywords as core keywords to construct a core keyword set.
The search equipment can respond to a search request triggered by a server corresponding to the intelligent customer service application, extracts corresponding basic keyword groups aiming at each search requirement in a search text, and then screens out candidate keywords from the core keywords by calculating the feature similarity between the overall text features corresponding to the basic keyword groups and the word features of the core keywords, and further determines the basic keywords and the candidate keywords contained in the basic keyword groups as the search keywords, wherein the intelligent customer service application can be various intelligent robot applications and the like.
And then, searching in each technical document by adopting a conventional searching mode according to each searching keyword or searching according to the searching mode provided by the application to finally obtain a searching result, and further, providing the searching result to a server corresponding to the intelligent customer service application so as to enable the intelligent customer service application to present the searched content.
And thirdly, assisting intelligent customer service application to realize the retrieval of historical data.
After each history document is constructed based on each history question-answer record, the retrieval device can firstly extract document keywords with importance meeting the set condition from each history document, then combine the document keywords with continuous content in all the document keywords, and further respectively use the processed document keywords as core keywords to construct a core keyword set, wherein one history document comprises each interactive text generated when one business object performs one business interaction in the intelligent customer service application, the process that the business object performs one business consultation on the intelligent customer service application is specifically corresponding to one business interaction, for example, each interactive text with interaction time interval lower than the set value can be determined as each interactive text in one business interaction process, and the sending object corresponding to one interactive text is the business object or the intelligent customer service application.
Furthermore, the search equipment can respond to a search request triggered by the intelligent customer service application based on the search text, extract corresponding basic keyword groups aiming at each search requirement in the search text, and then screen out candidate keywords from the core keywords by calculating the feature similarity between the overall text features corresponding to the basic keyword groups and the word features of the core keywords, and further determine the basic keywords in the basic keyword groups and the candidate keywords as the search keywords.
And then searching in each history document by adopting a conventional searching mode according to each searching keyword or searching according to the searching mode provided by the application to finally determine a searching result, and further providing the searching result to a server corresponding to the intelligent customer service application so as to enable the intelligent customer service application to present the searched content.
In addition, it should be understood that, in the specific embodiment of the present application, the extraction of the search keywords and the implementation of the search process are involved, and when the embodiments described in the present application are applied to specific products or technologies, the collection, use and processing of the relevant data need to comply with relevant laws and regulations and standards of relevant countries and regions.
The following describes a retrieval keyword extraction process from the perspective of a retrieval device with reference to the accompanying drawings:
referring to fig. 2, which is a schematic diagram of a process for extracting a search keyword according to an embodiment of the present application, the process for extracting a search keyword is described below with reference to fig. 2:
In the embodiment of the application, before extracting the search keywords, in order to realize full search based on the search text, the search equipment needs to analyze the obtained search text and determine the search requirements indicated in the search text, wherein the total number of the indicated search requirements can be one or more, and based on the search requirements, in order to describe the search requirements, a corresponding group of basic keywords are required to be extracted according to the search text.
In some embodiments for splitting a search requirement from a search text, in the case that the search text comprises predicates, the search device analyzes the search text in response to a search request triggered by a target object for the search text, and obtains each basic keyword and semantic roles corresponding to each basic keyword, wherein the semantic roles are used for indicating the predicates and various modifiers associated with the predicates, and then determines the basic keywords corresponding to the predicates according to the semantic roles of each basic keyword, and performs the following operation for the basic keywords corresponding to each corresponding predicate based on one predicate and the basic keywords used for modifying one predicate, so as to construct a corresponding basic keyword group.
When the search text is analyzed to obtain each basic keyword and a corresponding predicate judgment result, the search device may segment the search text, wherein the adopted segmentation mode includes, but is not limited to, a junction segmentation mode and a segmentation function in a natural language tool package (Natural Language Toolkit, NLTK), a semantic role labeling model (Semantic Role Labeling, SRL) is adopted to label semantic roles played by the predicates or words in the search text, wherein contents identified by the SRL include predicates, arguments and role labels, the predicates refer to verbs of trigger events, the arguments refer to roles of events expressed by the predicates, the role labels refer to names of specific semantic roles born by the arguments, based on the content identified by the SRL, the predicates and the modifiers of the predicates can be determined in the basic keywords, and in addition, the SRL can score importance of each basic keyword while labeling the semantic roles.
It should be appreciated that, based on the number of predicates parsed, the search text may be split into a corresponding number of search tasks, to achieve a corresponding number of search requirements,
In this way, considering that the target object may include a plurality of search requirements in the search text edited when the search operation is triggered, at least one search event can be generated corresponding to at least one predicate through identifying the semantic roles of the search text, so that at least one group of basic keywords representing at least one search requirement are obtained through combination, therefore, a processing basis can be provided for targeted processing aiming at different search requirements, the search effect is improved, and omission of the search requirements is avoided.
In other embodiments where the search requirement is split from the search text, the search text may be considered to have only one search requirement if no predicate is included in the search text.
For example, the search text is "mid-autumn festival", then only one search requirement for "mid-autumn festival" may be determined.
In still other embodiments that split the search requirement from the search text, where predicates are not included in the search text, but connectors are present, the two portions of text content that the connectors connect may be divided into different search requirements.
For example, the search text is "a model and B model", and then two search requirements, respectively, search for a model and search for B model, can be split.
In the following description of the present application, only processing for a single search requirement in a search text is taken as an example, and a process of extracting a search keyword for a single search requirement is schematically described:
Step 201, a retrieval device obtains a set of basic keywords extracted based on a retrieval text, wherein the set of basic keywords comprises at least one basic keyword for indicating a retrieval requirement.
Specifically, in order to process for one search requirement, the search device acquires a set of basic keywords extracted based on the search text, wherein the acquired set of basic keywords is used for indicating one search requirement and comprises at least one basic keyword.
It should be noted that, according to the number of search requirements covered by the search text, the obtained group of basic keywords may include all keywords obtained based on the search text segmentation, or the obtained group of basic keywords may include only part of keywords obtained based on the search text segmentation.
Step 202, the search equipment acquires a core keyword set constructed by corresponding candidate documents in a search database, wherein the core keyword set is obtained by respectively extracting document keywords with importance meeting a set condition from the candidate documents, combining the document keywords with continuous content by counting the coexistence condition and word spacing of every two document keywords in a single candidate document, and constructing the core keyword set based on the processed document keywords, and the importance is determined according to the word frequency of the corresponding document keywords in the extracted candidate documents and the occurrence condition of the corresponding document keywords in the candidate documents.
In the embodiment of the application, when the search equipment constructs a core keyword set corresponding to each candidate document in the search database, the search equipment can firstly extract document keywords from each candidate document respectively, then combine a plurality of document keywords with continuous content in all extracted document keywords, and finally construct the core keyword set based on each processed document keyword, wherein the content type of each candidate document comprises any one or combination of technical files and historical interaction records according to actual search requirements.
In the embodiment of the present application, there are two possible ways to construct the core keyword set, which are described below:
After every two continuous document keywords are combined, determining the combined document keywords and the document keywords which cannot be combined with other document keywords as core keywords.
Referring to fig. 3A, which is a schematic diagram of a process of building a core keyword set in an embodiment of the present application, the following describes a process performed when building the core keyword set with reference to fig. 3A:
step 301, the retrieval device extracts document keywords with importance meeting the set conditions from each candidate document.
When executing step 301, the search device uses all candidate documents included in the current search database as candidate documents according to which the core keyword set is extracted, and further, for each candidate document, performs word segmentation processing on one candidate document to obtain corresponding content keywords, performs importance calculation on the content keywords, and screens out content keywords with importance meeting the set condition as document keywords.
The setting conditions according to which the document keywords are obtained by screening can be specifically any one of N content keywords with highest importance, and content keywords with importance values reaching a set threshold, wherein N is a set positive integer. When the importance of the content keywords is calculated, the importance corresponding to the content keywords is determined according to the word frequency of the corresponding content keywords in the extracted candidate documents and the occurrence condition of the corresponding content keywords in each candidate document.
For example, referring to fig. 3B, which is a schematic diagram of a process of determining document keywords for each candidate document according to the embodiment of the present application, it can be known from the content illustrated in fig. 3B that, assuming that the setting condition is that 5 content keywords with the highest importance are screened out, then for each candidate document in the search database, the 5 content keywords with the highest importance can be determined as the screened out 5 document keywords.
In addition, it should be understood that after word segmentation is performed in the candidate documents, each content keyword is directly obtained, and further, by performing importance calculation and screening, part of the content keywords can be screened out and used as document keywords, so that in the process of describing the importance calculation, description can be performed from the angles of the content keywords or the document keywords.
Taking the description of the importance calculation process from the perspective of the document keywords as an example, when determining the importance of any one document keyword in one candidate document, the retrieval device can calculate the initial word frequency of the document keyword in the corresponding candidate document and the inverse document frequency of the document keyword in each candidate document, and then the initial word frequency is subjected to value limiting and shrinking to obtain the target word frequency, and the product result of the target word frequency and the inverse document frequency is used as the value of the importance of the document keyword.
For example, in calculating importance, the following formula is used for calculation:
The Key_score is a value of importance obtained by calculation of a document keyword (assumed to be a document keyword 1) in one candidate document (assumed to be a candidate document X) by a pointer; The method comprises the steps of representing a result obtained after value limiting of an initial word frequency, wherein IDF is an inverse document frequency of a document keyword 1 in each candidate document and is used for representing the occurrence condition of the document keyword 1 in the candidate document, alpha is a preset coefficient, the value is set according to actual processing requirements, and if the value can be 4, TF and IDF are calculated, and the calculation mode in a TF-IDF algorithm is followed for processing.
Therefore, when the importance value is calculated, the importance value of the special word is not high due to less occurrence times of the special word in the document when the keyword is extracted from the vertical document, and the extraction effect of the keyword of the document is improved.
Step 302, the search device divides every two coexisting document keywords in a single candidate document into a document keyword group.
Specifically, after determining the document keywords respectively determined from the candidate documents, the search device divides each two statistically determined document keywords coexisting in a single candidate document into one document keyword group, in other words, each constructed document keyword group includes two document keywords coexisting in one candidate document.
Optionally, in the mode of determining the keyword groups of each document within the range of each candidate document, each two of the keyword groups of each document belonging to the same candidate document can be combined once to obtain each keyword group of each document belonging to the same candidate document, and then, the keyword groups of each document determined by corresponding to different candidate documents are subjected to de-duplication to finally obtain each processed keyword group of each document.
For example, referring to fig. 3C, which is a schematic diagram of a process of determining a document keyword group in the embodiment of the present application, it is known from the content illustrated in fig. 3C that, assuming that two candidate documents are candidate documents 1-2, and 5 document keywords are extracted for each candidate document, then, by dividing each two of the 5 document keywords belonging to the candidate document 1 into one document keyword group, 10 document keyword groups illustrated in fig. 3C can be obtained for the candidate text 1, and similarly, 10 corresponding document keyword groups can be obtained for the candidate document 2. Then, the obtained 20 document keyword groups are subjected to duplication removal, so that 17 document keyword groups can be finally obtained.
Step 303, the search device performs the following operations of counting two corresponding document keywords for each divided document keyword group, calculating an average word distance when the two corresponding document keywords appear in each candidate document, and merging the two document keywords into one document keyword when determining that the two document keywords are continuous based on the average word distance.
After the search device divides each document keyword group, respectively judging whether two document keywords covered by each document keyword group are continuous in content or not, and merging the two document keywords with continuous content into one document keyword, wherein the mode of judging the content continuity between the two document keywords can be that an average word distance determined in a corresponding combination mode is the word distance corresponding to the condition that the content of the two document keywords is continuous.
It should be noted that, for two document keywords, there are two possible combinations according to the difference of the preceding document keywords, so in some possible implementations, the calculated average word distance specifically includes the average word distance under different combinations, and in other possible implementations, the calculated average word distance may be the word distance average under the combination that can obtain the minimum word distance, in combination with the actual processing requirement.
When calculating the average word distance between two document keywords in a combination mode, determining at least one coexisting document corresponding to the two document keywords in each candidate document, and then respectively calculating the word distance between the two document keywords in the at least one coexisting document to further calculate the average value of the at least one word distance to obtain the average word distance, wherein the word distance between the two document keywords specifically refers to the character distance between characters at the appointed position of the two document keywords, for example, the shortest character distance between the head characters of the two document keywords in the document is used as the word distance, or the shortest character distance between the tail characters of the two document keywords in the document is used as the word distance.
Based on this, in the process of merging document keywords, there are two possible processing manners:
in a feasible document keyword merging process, the retrieval device can respectively execute the following operations of respectively counting average word distances of two document keywords in each candidate document according to two combination modes determined by corresponding two document keywords, merging the two document keywords into one document keyword according to the corresponding combination mode when determining that the two document keywords are continuous according to any one average word distance, wherein when merging the two document keywords, the two document keywords can be spliced according to the corresponding combination mode, continuous repeated characters in a splicing result are removed, and the merged document keyword is obtained, and the repeated characters can be single or multiple.
For example, referring to FIG. 3D, which is a schematic illustration of a process for determining an average distance between words in an embodiment of the present application, it is known from the schematic illustration of FIG. 3D that, assuming that a document keyword phrase includes "batch" and "sign" and that candidate documents 1 and 2 are two coexisting documents determined for the document keyword phrase, then the distance between words determined in candidate document 1 is 2 characters for the combination of "batch" followed by "sign" and 2 characters for candidate document 2, and at the same time the distance between words determined in candidate document 1 is 26 characters for the combination of "batch" followed by "sign" and 20 characters for candidate document 2. Further, for the combination, the average distance was determined to be 23 characters for "batch" followed by "signing", and for the combination, the average character was determined to be 20 characters for "signing" followed by "batch".
For another example, for the document keywords "seal authorization" and "authorizer", in the case where it is determined that the average word distance of the header character is 2 characters, it may be determined that the two document keywords may be combined, and the combined result is "seal authorizer".
In other feasible document keyword merging processes, the search device can respectively execute the following operations of counting two corresponding document keywords, determining the minimum word distance when the two corresponding document keywords appear in each candidate document, determining the corresponding combination mode of the minimum word distances, obtaining the average word distance according to the minimum word distances, merging the two document keywords into one document keyword according to the corresponding combination mode when the two document keywords are determined to be continuous according to the average word distance, wherein when the two document keywords are merged, the two document keywords can be merged according to the corresponding combination mode, then continuous repeated characters in the merged result are removed, and the merged document keywords are obtained, and the repeated characters can be single or multiple.
For example, referring to FIG. 3E, which is a schematic diagram of another process for determining an average distance between words in an embodiment of the present application, according to what is schematically shown in FIG. 3E, assuming that a document keyword group includes "batch" and "sign", and that candidate documents 1 and 2 are two coexisting documents determined for the document keyword group, then the distance between words determined in candidate document 1 is 2 characters for a combination of "batch" and "sign", the distance between words determined in candidate document 1 is 26 characters for a combination of "sign" and "batch" and the minimum distance between words determined in candidate document 1 is 2 characters based on this, and similarly, the minimum distance between words determined for candidate document 2 is 2 characters, and thus the average distance between words can be determined to be 2 characters.
Step 304, the retrieval device constructs a core keyword set based on the processed keywords of each document.
In a feasible implementation manner, after the search device merges two document keywords in the document keywords corresponding to the continuous content, the document keywords obtained by merging and the document keywords which cannot be merged with other document keywords can be used as each core keyword to construct a corresponding core keyword set.
In other possible implementations, the search device may repeatedly execute steps 301 to 304 for the document keywords obtained by merging and the document keywords that cannot be merged with other document keywords until a preset number of repeated executions is satisfied, and construct the core keyword set based on the document keywords obtained by merging obtained after the last execution of the steps and the determined document keywords that cannot be merged with other document keywords.
In this way, by merging the document keywords with continuous content from the document keywords with importance meeting the set conditions, adverse effects caused by incorrect word segmentation can be made up, so that the determined core keywords can include words which are not frequently appeared, and effective core keywords can be sorted out for any type of candidate documents.
And after combining every two continuous document keywords, constructing a keyword graph based on the combined document keywords and the document keywords which cannot be combined with other document keywords, and sorting to obtain each core keyword according to the node structure in the keyword graph.
Referring to fig. 4A, which is a schematic diagram of another process for constructing a core keyword set according to an embodiment of the present application, the following describes the related construction process with reference to fig. 4A:
Step 401, the retrieval device extracts document keywords with importance meeting the set conditions from each candidate document.
The search device may perform the same processing as the step 301 when executing the step 401, and the present application will not be described here.
Step 402, the search device divides every two coexisting document keywords in a single candidate document into a document keyword group.
The search device may perform the same processing as step 302 when executing step 402, and the present application will not be described in detail herein.
Step 403, the search device performs the following operations of counting two corresponding document keywords for each divided document keyword group, calculating an average word distance when the two corresponding document keywords appear in each candidate document, and merging the two document keywords into one document keyword when determining that the two document keywords are continuous based on the average word distance.
The search device performs step 403, and may perform the processing in the same manner as in step 303, and the present application will not be described in detail herein.
It should be understood that the coexistence relation of two document keywords in candidate documents specifically refers to the situation that two document keywords appear in the same text, wherein the text can be the document itself or a paragraph in the document depending on the input document type, and the shorter the text length, the better the effect. In the process of calculating the average word distance for one document keyword group, the ratio between the accumulated character distance when one document keyword group appears in each candidate text and the total number of candidate documents with two document keywords in one document keyword group can be used as the corresponding average word distance.
Furthermore, when merging document keywords by means of the average word distance, a prefix-suffix relationship may be formed between two document keywords in a pair in a determined document keyword group, and when the distance between the prefixes and the suffixes is equal to the average word distance, the two document keywords may be merged into one document keyword.
Step 404. The retrieval device performs, for each processed document keyword, an operation of counting the total number of other keywords coexisting with one document keyword in a single candidate document among the processed document keywords.
Specifically, in performing step 404, the retrieval device may perform, for each processed document keyword, the operations of determining a candidate document containing the document keyword, and counting the total number of processed document keywords included in the candidate document, and determining the total number of other keywords coexisting with the document keyword in a single candidate document based on the determined total number of document keywords.
For example, taking a candidate document as an example, the document keywords extracted from the candidate document are determined, and after the keywords with continuous content are combined, the processed document keywords included in the candidate document can be determined. Further, for each processed document keyword, the inclusion in the candidate document can be determined, thereby counting the total number of corresponding other keywords.
And 405, the retrieval device takes the processed document keywords as each graph node, and establishes a directional connection edge between the graph nodes with the difference of the total number of other related keywords to obtain a constructed keyword graph.
In the embodiment of the application, the retrieval equipment constructs a keyword graph according to the coexistence relation among the processed document keywords.
Specifically, each processed document keyword is used as each graph node, a directed connection edge is established between graph nodes with differences in total numbers of other related keywords, and a constructed keyword graph is finally obtained, wherein the directed connection edge points to the graph nodes with the few total numbers of the related keywords and points to the graph nodes with the many total numbers of the related other keywords or points to the graph nodes with the many total numbers of the related other keywords.
Step 406, the search device merges every three graph nodes meeting the merging condition into the graph nodes corresponding to the corresponding combined keywords in the keyword graph, wherein the merging condition is that the preset three-node structure is met, and the combined keywords obtained by the corresponding three document keywords exist in each candidate document.
In executing step 406, the retrieving device merges every three graph nodes that satisfy the merging condition into the graph nodes corresponding to the respective combined keywords in the constructed keyword graph.
Specifically, the search device can search three graph nodes matched with a preset three-node structure in a keyword graph, then combine the three graph nodes based on document keywords corresponding to the three graph nodes according to a preset content combining mode to obtain a combined keyword, and then determine that the combining operation of the keywords is effective when the total occurrence times of the combined keyword in each candidate document is higher than a set threshold value, so that the three determined graph nodes can be combined into a new graph node, the original three graph nodes are deleted, wherein the value of the set threshold value is set according to actual processing requirements, when the combined keyword is obtained through combining, the three document keywords can be subjected to postfix splicing, one possible implementation mode of postfix splicing is that repeated single or multiple characters in a splicing result are subjected to duplicate after the direct content splicing, and the preset content combining mode can be that the three document keywords are randomly combined to obtain all possible combined keywords, or the combined keyword is subjected to specific constraint according to the determined combination rule under the condition that the combination is counted in advance.
It should be noted that the three-node structure includes any one of two graph nodes pointing to one identical graph node, and a connecting edge exists between the two graph nodes.
For example, referring to fig. 4B, it is shown that a preset three-node structure diagram in the embodiment of the present application is shown, according to what is shown in fig. 4B, it is assumed that a graph node X and a graph node Z both point to a graph node Y, or that a graph node X and a graph node Z both point to a graph node Y, and a connecting edge exists between the graph node X and the graph node Z, if subsequent stitching is preset according to the sequence of X-Y-Z, then the graph node X, the graph node Y, and the graph node Z may be subjected to suffix stitching according to the sequence of X-Y-Z to obtain new combined keywords, and further, when it is determined that the combined keywords are valid, that is, when the total occurrence times of the combined keywords in each candidate document reach a set threshold value, three graph nodes with a preset three-node structure isomorphic are combined in a keyword graph, so as to obtain a new graph node, and the combined keywords are used as keywords corresponding to the new graph node.
Therefore, by means of the preset three-node structure, analysis and arrangement of the graph node structure can be carried out in the keyword graph, so that the complexity of the keyword graph is reduced, keywords with continuous content are combined, and the influence caused by bad word segmentation can be further reduced.
Step 407, the search device constructs a core keyword set based on keywords corresponding to each graph node in the keyword graph.
After finishing the arrangement of the keyword graphs, the search equipment can respectively use the keywords corresponding to each graph node in the keyword graphs as core keywords to construct a corresponding core keyword set.
Further, after completing the construction of the core keyword set, in the execution process of step 202, the search device may acquire the core keyword set constructed corresponding to each candidate document in the search database.
In addition, it should be understood that in the embodiment of the present application, the search device may obtain the candidate documents uploaded by other devices in response to the candidate document upload instruction triggered by the other devices, update the content of the search database based on the newly obtained candidate documents, and re-use the processing manner of the first or second mode to construct the core keyword set according to the updated search database.
In this way, in the processing of steps 401-407, the document keywords with continuous content can be effectively determined according to the average word distance and the node structure relationship among the document keywords, so that the influence caused by poor word segmentation effect is compensated, and the guarantee is provided for the construction effect of the core keyword set.
Step 203, the retrieval device determines each candidate keyword which meets the preset screening condition in each core keyword according to the similarity condition between the whole text characteristics extracted by the corresponding group of basic keywords and the word characteristics extracted by the corresponding core keywords.
Specifically, when extracting each candidate keyword meeting the preset screening condition from each core keyword in the core keyword set, the search equipment can extract the corresponding integral text feature aiming at the obtained group of basic keywords, then calculate the similarity between the integral text feature and the word feature of each core keyword to obtain the similarity between each core keyword and the current search requirement, screen out the most similar H core keywords as each candidate keyword meeting the preset screening condition, and set the value of H according to the actual processing requirement.
When extracting the whole text feature and the word feature of the core keyword, various feasible text feature extraction modes can be adopted, and the application is not particularly limited to the above.
For example, a feasible word embedding (Embedding) model may be adopted, word features are respectively extracted for each basic keyword in a group of basic keywords, and the word features are fused to obtain overall text features, and meanwhile, word features are respectively extracted for each core keyword, and after similarity calculation is performed by adopting a Embedding model, 3 core keywords with highest similarity are output.
For example, referring to fig. 4C, which is a schematic diagram of a process of screening candidate keywords according to an embodiment of the present application, it can be known from the content illustrated in fig. 4C that, assuming that a current group of basic keywords includes keys 1-4, after performing similarity calculation in combination with a core keyword set, each search keyword illustrated in fig. 4C can be obtained assuming that candidate keywords Key5 and Key7 are determined to be screened.
Step 204, the search device determines at least one basic keyword and each candidate keyword as each search keyword.
Specifically, the search device may determine at least one search keyword that currently characterizes the search requirement and each candidate keyword that is newly screened as each search keyword, for use in a subsequent search process.
Further, referring to fig. 5A, which is a schematic diagram of a process of searching based on each search keyword in the embodiment of the present application, a search process executed according to each search keyword after each search keyword is obtained is described below with reference to fig. 5A:
in step 501, the searching device searches and determines each initial document respectively associated with at least one search keyword according to the preset M searching modes and combining each search keyword in the searching database.
In the embodiment of the present application, when executing step 501, the searching device needs to determine M searching modes according to the searching basis.
The M search modes can be obtained by obtaining custom search logics, determining the search modes according to the search logics, obtaining search modules defined by an external system, and obtaining corresponding search modes by loading the search modules. The self-defined retrieval logic can be defined by a development object or a target object using a retrieval function, the retrieval equipment can call a retrieval mode provided by an external system to retrieve through loading a retrieval module of the external system, and the value of M is determined according to the retrieval mode obtained in practice.
In this way, the processing can be carried out according to a plurality of feasible search modes in the search stage, so that the search results of the plurality of search modes can be comprehensively processed in the search process, and the used search modes possibly comprise customized search logic, so that the personalized customization of the search modes is supported, and the improvement of the search effect is facilitated.
Further, after determining M search modes according to which the search is based, the search device determines each initial document associated with at least one search keyword among candidate documents in the search database in accordance with the M search modes in combination with each search keyword.
It should be noted that the application is not limited to the search logic corresponding to the M search modes, and various feasible text search logic may be selected for processing according to actual processing requirements, for example, the feasible search logic may include content matching based on each search keyword, vector similarity matching with each candidate text for each search keyword, and the like.
In the embodiment of the application, in the process of retrieving each initial document according to M retrieval modes, the following two execution modes can be adopted for retrieval, and the following two execution modes are respectively described:
and performing serial search according to the M search modes.
Specifically, in the processing procedure of the first execution mode, the search device combines the search keywords according to the preset M search modes in the search database, and after performing M search processes in series, obtains each initial document which is determined by the last search process and is respectively associated with at least one search keyword.
It should be noted that, in the serial search process, the search device may determine the execution sequence of various search modes according to the actual processing requirement, based on which, for any one search process, the corresponding search mode may be determined according to the preset execution sequence, where in the serial process, one search process corresponds to one search mode.
Taking the processing in any one of the retrieval processes except the last retrieval process as an example, the retrieval device performs the following operations of determining each candidate document of the retrieval basis and one retrieval mode of the basis (assuming that the current retrieval process adopts the retrieval mode Z), adopting the retrieval mode Z, determining each matching document and the corresponding content matching degree in each candidate document of the basis according to each retrieval keyword, respectively calculating the accumulated value of the content matching degree up to the current retrieval process for each candidate document of the basis, and screening out the specified number of target documents with the highest accumulated value as each candidate document of the basis of the next retrieval process, wherein the specified number is determined according to the execution sequence of the retrieval process, and the specified number determined for different retrieval processes may be different.
It should be noted that, for each candidate document according to one search process, each candidate document according to the first search process is all candidate documents in the search database, each candidate document according to the other search process is obtained after screening the previous search process, and in the case that there are M search modes, there are M corresponding search processes. In addition, in each retrieval process, according to each retrieval keyword, the content matching degree corresponding to each matching document and each matching document can be retrieved and determined in each candidate document according to the retrieval basis, wherein the content matching degree can be obtained in various feasible processing modes in the retrieval process in the field, the application is not particularly limited in this regard, and the content matching degree is used for describing the matching degree between the retrieved matching document and the retrieval requirement.
For example, referring to fig. 5B, which is a schematic diagram of a process of performing a search process in series in the embodiment of the present application, it can be seen from the content illustrated in fig. 5B that, assuming that the value of M is 2, there are two search modes, namely, search mode 1-2, and the number of matching documents searched by the search process 1 is set to be 6, and the number of matching documents searched by the search process 2 is set to be 3. Then, in the process of executing two search processes in series, a search mode 1 is adopted to search 6 candidate documents with highest content matching degree in a search database according to each search keyword, and the candidate documents 1-6 are assumed to be candidate documents 1-6, then, a search mode 2 is adopted to search in the candidate documents 1-6 according to each search keyword, so as to obtain the content matching degree determined by the search mode 2 for the searched matching documents, based on the content matching degree output by the search mode 2 of part of or all of the candidate documents 1-6, and then, the accumulated value of the content matching degree of the candidate documents 1-6 cut-off to the search process 2 is calculated for the candidate documents 1-6.
Continuing with the description of fig. 5B, taking the search mode 2 to search for the candidate document 1 as an example, the content matching degrees of the candidate document 1 in the two search modes may be weighted and overlapped according to the weight values configured for the search mode 1 and the search mode 2 to obtain the accumulated value of the content matching degrees, that is, score1.1 is obtained by weighting score1, score1.2 is obtained by weighting the content matching degrees determined by the search mode 2, wherein the weighted weight is set according to the actual processing requirement, and in particular, assuming that the candidate document 2 is not searched for by the search mode 2, the score2.2 illustrated in fig. 5B has a value of 0.
In this way, in the serial processing process, the input of the search process executed later is the output of the search process executed earlier, so that the search range can be gradually narrowed down by executing the serial search process, and finally, the initial document meeting the requirement can be effectively searched according to the accumulated value of the content matching degree determined by each search process for the candidate document.
And performing parallel search according to the M search modes.
In the processing of the second execution mode, the search equipment respectively combines the search keywords according to the preset M search modes in a search database, parallelly executes M search processes to obtain at least one matching document of the association content matching degree determined in each search process, and screens out each initial document from each matching document based on the at least one content matching degree respectively associated with each matching document.
Specifically, in the process of screening each initial document from the determined matching documents based on at least one content matching degree associated with each matching document, the search device may perform operations of determining at least one content matching degree associated with one matching document, determining a target search mode corresponding to each at least one content matching degree, and determining a matching degree fusion value corresponding to one matching document according to mode weights preset for each of the M search modes and combining the content matching degrees in at least one target search mode. And then, determining each matching document with the corresponding matching degree fusion value reaching the preset condition as each initial document.
In the parallel searching process, the searching device can adopt M searching modes to execute M searching processes in parallel in each candidate document included in the searching database, and then, Q matching documents with highest matching degree fusion values are screened out by sorting the matching degree fusion values of the matching documents in the M searching processes to serve as each initial document meeting preset conditions, wherein the value of Q is set according to actual processing requirements, and the specific limitation is not adopted.
In some possible implementations, taking calculating a matching degree fusion value for a matching document as an example, the matching degree fusion value corresponding to the matching document can be obtained by weighted superposition of content matching degrees determined by the M search modes for the matching document according to mode weights respectively set for the M search modes, where when the search result of one search mode does not include the matching document, the matching degree fusion value of the matching document in the search mode is 0.
In a possible implementation manner, a mixed search scoring (Reciprocal Ranked Fusion, RRF) algorithm may be adopted to calculate a matching degree fusion value corresponding to the matching document according to the sum of the ranks of the matching documents in each search process, where the ranks are determined according to the descending order of the content matching degrees, and when the search result of one search manner does not include the matching document, the ranks of the matching document in the search manner may be set as a default value far greater than the maximum ranking value in the matching result by default.
For example, the matching degree fusion value of a matching document can be calculated using the following formula:
Wherein value is the calculated matching degree fusion value, weight is the mode weight corresponding to a retrieval mode, rank is the ranking of matching documents in a retrieval mode, and k is a constant, for example, 60 can be set by default.
For example, referring to fig. 5C, which is a schematic diagram of a process of executing a search process in parallel in the embodiment of the present application, it can be known from the content illustrated in fig. 5C that, assuming that the value of M is 2, there are two search modes, namely, search modes 1-2, respectively, and the number of matching documents searched by the search processes 1 and 2 is set to be 6, and the finally determined search result includes 6 initial documents. Then, in the process of executing two search processes in parallel, according to the content illustrated in fig. 5C, it is known that 6 matching documents determined by using the search mode 1 are candidate documents 1-6, and are ranked in descending order of content matching degree, 6 matching documents determined by using the search mode 2 are ranked in descending order of content matching degree, namely candidate documents 1,3, 5, 6, 7 and 8, further, corresponding matching degree fusion values are respectively determined for each obtained matching document, and ranking results are ranked in descending order of matching degree fusion values, namely candidate documents 1,3, 5, 6, 2, 7, 8 and 4, further, 6 initial documents finally determined are candidate documents 1,3, 5, 6, 2 and 7, wherein subscripts of score in fig. 5C represent ranking results by taking values of content matching degree, and subscripts of value represent ranking results by taking values of matching degree fusion values.
In the parallel processing process, the search process is executed in parallel, comprehensive screening is carried out according to the search results of each search process, and the search results of each search process are integrated into a final search result, so that the initial document meeting the requirements can be effectively searched on the basis of the matching conditions under the comprehensive various search processes.
Step 502, the retrieval device obtains a constructed document structure diagram, wherein the document structure diagram comprises nodes which are respectively constructed corresponding to various document data blocks in each candidate document and connecting edges which are used for representing content link relations and content attribution relations.
In the embodiment of the application, in order to analyze each candidate document included in the search database from the perspective of the document structure, the search device may construct a corresponding document structure diagram for each candidate document included in the search database after constructing a core keyword set for the search database. Based on this, in the processing procedure of step 502, the retrieving device may acquire a pre-constructed document structure diagram for processing, where the document structure diagram includes nodes respectively constructed corresponding to various document data blocks in each candidate document, and connection edges for characterizing a content link relationship and a content attribution relationship, where each node has multiple types and is used for corresponding to data blocks in different structures in the document, and each node is associated with corresponding attribute content.
The following describes a procedure for constructing a document structure diagram:
When constructing a document structure diagram, firstly, document data of each candidate document is required to be segmented and stored as nodes in a diagram database, corresponding core keywords are matched for the node content of each node according to a constructed core keyword set, and then, connection relations among the nodes of the diagram database are generated according to the containing or linking relations among the candidate documents. Based on the method, after the corresponding document structure diagram is constructed based on the connecting edges between the nodes, the link relation between the candidate documents and the attribution condition of the core keywords in the candidate documents can be represented in the document structure diagram.
For example, referring to Table 1, a schematic representation of the document structure determined for segmenting candidate documents in an embodiment of the application is shown:
TABLE 1
As can be seen from the contents illustrated in table 1, the attribution condition of the upper content indicated by "father", the link condition indicated by "link" indicates a link condition, the link relation with other documents can be determined based on the content indicated by "link", and the core keyword of the overlay can be indicated according to "doc_ corpus".
It should be noted that, the construction of the document structure diagram and the update of the core keyword set may be synchronous, and when the core keyword set is updated, the corresponding document structure diagram needs to be updated.
Step 503, the searching device clusters the initial documents with the link relation according to the document structure diagram to obtain target document sets of various associated search keywords, and respectively generates search results based on the target document sets.
In performing step 503, the search apparatus may perform an operation of determining at least one hit keyword included in one initial document among the search keywords, and further performing an operation of determining a target node to which the hit keyword belongs in a document structure diagram and clustering the one initial document and the other initial documents when determining that the target node and child nodes belonging to the target node have connection edges linked to the other initial documents, for each hit keyword.
Specifically, the search apparatus may determine, for each search keyword, at least one hit document including one search keyword among the initial documents, and thus may determine, for each initial document, a corresponding at least one hit keyword among the search keywords. Furthermore, in order to ensure that the initial documents of the clusters have extremely strong relevance, the initial documents can be clustered according to the determined link relation by searching whether links to other initial documents exist in the document data block with the largest relevance of the hit keywords from the hit keywords in the initial documents.
When determining the document data block with the largest relevance of the hit keywords, determining the target node to which the hit keywords belong and the sub-node of the target node in a document structure diagram, further taking the processing of an initial document as an example, after determining the target node and the sub-node corresponding to the initial document, searching whether the determined target node and sub-node point to the connecting edges of other initial documents in the document structure diagram, and clustering the pointed other initial documents with the currently processed initial document when determining the existence of the pointed other initial documents. In addition, during clustering, a corresponding clustering result can be obtained according to each link relation.
For example, if it is determined that the initial document 1 has a link relationship with both the initial document 2 and the initial document 3, the initial document 1 and the initial document 2 may be clustered to obtain a clustering result, and the initial document 1 and the initial document 3 may be clustered to obtain a clustering result.
In this way, the link relation among the documents is determined according to the document structure diagram, so that the condition that one initial document needs to be linked with other initial documents to carry out auxiliary explanation of the content can be identified, and the usability of the search result can be improved by clustering the initial documents with close association.
In addition, the applicant carries out a series of tests to determine that the retrieval mode provided by the application can generate a better vertical document retrieval effect under the condition of little or no manual intervention.
Referring to table 2, a schematic table for comparing the processing effects in the embodiment of the application is shown:
TABLE 2
Wherein about 1000 keyword combinations are used in the actual measurement process, and about 3000 search questions are used. The accuracy formula is acc=the number of questions/total number of questions that successfully get the relevant answers.
Therefore, the technical scheme provided by the application can obtain very good retrieval effect under various retrieval scenes.
In summary, in the embodiment of the present application, when extraction of a search keyword and specific search processing are implemented, related search keyword extraction and search processes may be performed by constructing a search system including different functional modules.
Referring to fig. 6, which is a schematic diagram of system functions in an embodiment of the present application, the following describes functional modules involved in systematically performing a search process with reference to fig. 6:
and the keyword analysis module is used for finding important words, phrases and terms from the candidate documents in the search database by means of word frequency statistics and word coexistence structure analysis to construct a core keyword set.
And the database importing/converting module is used for analyzing the document structure, splitting the document structure into nodes required by the document map database (or called document structure map), and configuring indexes and relations.
And the input analysis module is used for decomposing an input natural language query sentence (or called a search text) into a plurality of subtasks, and extracting keywords from each subtask to obtain basic keyword groups representing different search requirements.
And the coarse-ranking module fuses the results of the search algorithm module by using the mode of a Pipeline (Pipeline), determines the nodes corresponding to the initial documents in a parallel or serial mode, and then obtains the document search results meeting the conditions by using the graph node search module for realizing document clustering according to the structure and the connection relation of the nodes in the graph.
The precision ranking module is used for scoring and ranking candidate document search results by using the result of the pattern fusion scoring algorithm module of the Pipeline, wherein the scoring algorithm is used for determining corresponding scores according to the document search results obtained by clustering initial documents and the importance of the search keywords covered by the document search results, wherein the search keywords comprise basic keywords directly determined from search texts and core keywords determined from a core keyword set, so that the importance of the search keywords can be evaluated according to an SRL model or can be calculated for each core keyword when the core keyword set is constructed, and the scoring calculation mode can be adopted in various feasible calculation modes in the field, such as the importance value of the search keywords directly overlapped.
In particular, according to the content illustrated in FIG. 6, the content index refers to the index of the relationship between the title and the content, the Link relationship refers to the document Link relationship, and the indirect Link (INDIRECTLINK) relationship refers to the indirect Link relationship, e.g., in the case where there is a Link X in one paragraph under the title, the relationship between the other paragraphs under the title and the Link X may be referred to as INDIRECTLINK relationship.
It should be noted that, considering that when constructing the document structure diagram and the core keyword set, it needs to be performed for all documents in the search data set, therefore, alternatively, when the online system obtains the imported document data, the update of the document structure diagram and the core keyword set may be performed immediately, or the offline update of the document structure diagram and the core keyword set may be performed in a service space time, for example, in the early morning.
The following describes, with reference to the accompanying drawings, a retrieval process related to a vertical document by taking the retrieval as an example:
Referring to fig. 7A, which is a schematic diagram of a process of searching by an auxiliary customer service system according to an embodiment of the present application, a search interaction process performed in a feasible processing scenario is described below with reference to fig. 7A:
Step 701, the business device sends the imported document to the retrieval device.
Step 702, the retrieval device constructs a retrieval index.
The constructed content comprises a document structure diagram and a core keyword set.
Step 703, the search device sends indication information for completing the construction of the search index to the service device.
Step 704, the external device registers the search algorithm module with the search device.
Wherein the search function can be implemented by means of a search algorithm module.
Step 705, the external device configures the weights of the search algorithm module to the retrieval device.
Step 706, the search device completes loading the search algorithm module.
Step 707, the retrieving device sends indication information of the loading completion confirmation to the external device.
Step 708, the client device inputs the retrieved text to the business device.
Step 709, the service device sends the search text to the search device.
Step 710, the retrieval device feeds back the basic keywords extracted based on the retrieval text to the service device.
And 711, the service equipment initially feeds back the content related to the basic keywords to the client equipment based on the obtained basic keywords.
Step 712, the business device triggers content retrieval based on the set of base keywords to the retrieval device.
Step 713, the search device completes extraction of the search keywords and obtains the clustering result of each initial document searched according to the search keywords.
Step 714, the searching device feeds the clustering result of each initial document back to the business device.
And 715, processing each clustering result by the service equipment.
The business device presents the reply content based on the retrieved text to the client device, step 716.
Further, referring to fig. 7B, which is a schematic diagram of a search process in an embodiment of the present application, the following describes the related search process with reference to fig. 7B:
In a specific retrieval process, the retrieval equipment extracts a retrieval keyword corresponding to a retrieval requirement from a retrieval text and marks the retrieval keyword as Key1-3, and then, the retrieval equipment carries out rough-row retrieval according to each retrieval keyword and combines a graph database to obtain each initial document and marks the initial document as Root1-4, wherein the graph database comprises a document structure diagram determined according to each candidate document in the retrieval database.
Then, according to the contained search keywords, determining initial documents corresponding to each search keyword, namely hit documents corresponding to the search keywords, wherein, for example, the hit documents of Key1 are Root1, 2 and 4. Further, from the perspective of the initial document, the search keywords covered inside are clustered to obtain hit search words corresponding to each initial document, for example, the hit search words corresponding to Root1 are Key1 and Key2.
Further, in combination with the document structure indicated in the graph database, the link relationship between the initial documents is queried, and assuming that the target node to which the Key1 belongs in the document structure diagram or the child node belonging to the target node exists in the document structure diagram, the connection edges linked to the Root2 and the Root3 can be clustered, so that the initial documents Root1 and Root2 and the initial documents Root1 and Root3 can be clustered respectively, and the clustering result illustrated in fig. 7B is obtained.
In summary, the application provides a method for constructing an automated vertical knowledge base retrieval system. The method can automatically extract core keywords through the text and structural characteristics of the documents, so that the terms and keywords in the vertical documents can be accurately extracted through statistics and structural matching, and a search framework capable of flexibly combining various search algorithms is provided, so that an accurate knowledge base search effect is realized, and an optimal search effect can be obtained. Moreover, the technical scheme provided by the application can be applied to a wide vertical knowledge base searching scene, including but not limited to technical document searching, intelligent customer service scene and the like.
Based on the same inventive concept, referring to fig. 8, which is a schematic diagram of a logic structure of an extraction device of a search keyword in an embodiment of the present application, the extraction device 800 of the search keyword includes a first obtaining unit 801, a second obtaining unit 802, a screening unit 803, and a determining unit 804, where,
A first obtaining unit 801, configured to obtain a set of basic keywords extracted based on a search text, where the set of basic keywords includes at least one basic keyword for indicating a search requirement;
A second obtaining unit 802, configured to obtain a core keyword set constructed for each candidate document in the corresponding search database, where the core keyword set is obtained by extracting, from each candidate document, a document keyword whose importance meets a set condition, combining document keywords with continuous content by counting coexistence situations and word distances of every two document keywords in a single candidate document, and constructing the core keyword set based on each processed document keyword, where the importance is determined according to word frequencies of the corresponding document keyword in the extracted candidate document and occurrence situations in each candidate document;
A screening unit 803, configured to determine, according to the overall text features extracted corresponding to a set of basic keywords and the similarity between the overall text features extracted corresponding to each core keyword, each candidate keyword that meets a preset screening condition in each core keyword;
a determining unit 804, configured to determine at least one basic keyword and each candidate keyword as each search keyword.
Optionally, after determining at least one basic keyword and each candidate keyword as each search keyword, the search unit 805 in the apparatus is configured to:
in a search database, according to preset M search modes, combining each search keyword, and searching and determining each initial document respectively associated with at least one search keyword;
Obtaining a constructed document structure diagram, wherein the document structure diagram comprises nodes which are respectively constructed corresponding to various document data blocks in each candidate document and connecting edges used for representing a content link relation and a content attribution relation;
Clustering the initial documents with the link relation according to the document structure diagram to obtain target document sets of various associated search keywords, and respectively generating search results based on the target document sets.
Optionally, in the search database, according to a preset plurality of search modes, when each initial document associated with at least one search keyword is searched and determined by combining each search keyword, the search unit 805 is configured to perform any one of the following operations:
In a search database, according to preset M search modes, combining each search keyword, and after performing M search processes in series, obtaining each initial document which is determined by the last search process and is respectively associated with at least one search keyword;
In a search database, M search processes are executed in parallel according to preset M search modes and in combination with each search keyword respectively, at least one matching document of the associated content matching degree determined in each search process is obtained, and each initial document is screened out from each matching document based on at least one content matching degree associated with each matching document.
Optionally, the M search modes are acquired by the search unit 805 in at least one of the following modes:
acquiring each self-defined retrieval logic, and determining each retrieval mode according to each retrieval logic;
And acquiring each search module defined by an external system, and acquiring corresponding search modes by loading each search module.
Optionally, in any one retrieval process except the last retrieval process, the retrieval unit 805 performs the following operations:
determining each candidate document of the retrieval basis and a retrieval mode of the basis;
determining each matching document and the corresponding content matching degree in each candidate document according to each search keyword by adopting a search mode;
For each candidate document according to the result, calculating the accumulated value of the content matching degree up to the current retrieval process, screening out the designated number of target documents with the highest accumulated value, and determining the designated number according to the execution sequence of the retrieval process.
Optionally, when each initial document is selected from the matching documents based on at least one content matching degree associated with each matching document, the retrieving unit 805 is configured to:
determining at least one content matching degree associated with one matching document, determining target retrieval modes corresponding to the at least one content matching degree, and determining a matching degree fusion value corresponding to the matching document according to mode weights respectively preset for M retrieval modes and combining the content matching degrees in the at least one target retrieval mode;
and determining each matching document with the corresponding matching degree fusion value reaching the preset condition as each initial document.
Optionally, when clustering the initial documents with the link relationship according to the document structure diagram, the retrieving unit 805 is configured to include:
for each initial document, the following is performed:
Determining at least one hit keyword contained in one initial document among the search keywords;
For each hit keyword, determining a target node to which the hit keyword belongs in a document structure diagram, and clustering one initial document with other initial documents when determining that the target node and child nodes belonging to the target node have connecting edges linked to other initial documents.
Optionally, before acquiring a set of basic keywords extracted based on the search text, the first acquiring unit 801 is configured to:
responding to a search request triggered by a target object aiming at a search text, analyzing the search text to obtain each basic keyword and semantic roles corresponding to each basic keyword, wherein the semantic roles are used for indicating predicates and various modifiers associated with the predicates;
according to the semantic roles of the basic keywords, basic keywords of the corresponding predicates are determined, and aiming at the basic keywords of each corresponding predicate, the following operation is performed, namely, a corresponding basic keyword phrase is constructed based on the basic keywords corresponding to one predicate and the basic keywords used for modifying the one predicate.
Optionally, before acquiring a set of basic keywords extracted based on the search text, the second acquisition unit 802 constructs a core keyword set in the following manner:
Extracting document keywords with importance meeting a set condition from each candidate document respectively;
dividing every two coexisting document keywords in a single candidate document into a document keyword group;
Counting two corresponding document keywords, calculating an average word distance when the two corresponding document keywords appear in each candidate document, and merging the two document keywords into one document keyword when determining that the two document keywords are continuous based on the average word distance;
And constructing a core keyword set based on the processed keywords of each document.
Optionally, when constructing the core keyword set based on the processed document keywords, the second obtaining unit 802 is configured to:
for each processed document keyword, performing the operations of counting the total number of other keywords coexisting with one document keyword in a single candidate document among the processed document keywords;
Taking the processed document keywords as each graph node, and after establishing directed connection edges between graph nodes with differences in total numbers of other related keywords, obtaining a constructed keyword graph;
Combining every three graph nodes meeting a combination condition into graph nodes corresponding to corresponding combined keywords in the keyword graph, wherein the combination condition is that the preset three-node structure is met, and the combined keywords obtained by the corresponding three document keywords exist in each candidate document;
and constructing a core keyword set based on keywords corresponding to each graph node in the keyword graph.
Optionally, the three-node structure includes any one of the following:
two graph nodes point to one and the same graph node;
the two graph nodes point to one and the same graph node, and a connecting edge exists between the two graph nodes.
Optionally, the importance of a document keyword is determined by the second obtaining unit 802 in the following manner:
Calculating initial word frequency of a document keyword in a corresponding candidate document and inverse document frequency of each document keyword in each candidate document;
And (3) carrying out value limiting on the initial word frequency to obtain a target word frequency, and taking the product result of the target word frequency and the inverse document frequency as the value of the importance of a document keyword.
For convenience of description, the above parts are described as being functionally divided into modules (or units) respectively. Of course, the functions of each module (or unit) may be implemented in the same piece or pieces of software or hardware when implementing the present application.
Having described the method and apparatus for extracting a search keyword according to an exemplary embodiment of the present application, next, an electronic device according to another exemplary embodiment of the present application is described.
Those skilled in the art will appreciate that the various aspects of the application may be implemented as a system, method, or program product. Accordingly, aspects of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects that may be referred to herein collectively as a "circuit," module "or" system.
The embodiment of the application also provides electronic equipment based on the same conception as the embodiment of the method. Referring to fig. 9, a schematic diagram of a hardware component of an electronic device to which an embodiment of the present application is applied, in one embodiment, the electronic device may be the search device 110 shown in fig. 1. In this embodiment, the electronic device may be configured as shown in fig. 9, including a memory 901, a communication module 903, and one or more processors 902.
A memory 901 for storing a computer program executed by the processor 902. The memory 901 may mainly include a storage program area for storing an operating system, programs required for running an instant communication function, and the like, and a storage data area for storing various instant communication information, an operation instruction set, and the like.
The memory 901 may be a volatile memory (RAM) such as a random-access memory (RAM), a nonvolatile memory (non-volatile memory) such as a read-only memory (rom), a flash memory (flash memory), a hard disk (HARD DISK DRIVE, HDD) or a solid state disk (solid-state drive) (STATE DRIVE, SSD), or any other medium that can be used to carry or store a desired computer program in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. The memory 901 may be a combination of the above memories.
The processor 902 may include one or more central processing units (central processing unit, CPUs) or digital processing units, or the like. A processor 902 for implementing the extraction method of the search keyword when calling the computer program stored in the memory 901.
The communication module 903 is used to communicate with the client device and the server.
The specific connection medium between the memory 901, the communication module 903, and the processor 902 is not limited in the embodiment of the present application. The embodiment of the present application is shown in fig. 9, where the memory 901 and the processor 902 are connected by a bus 904, where the bus 904 is depicted in bold in fig. 9, and the connection between other components is merely illustrative, and not limiting. The bus 904 may be divided into an address bus, a data bus, a control bus, and the like. For ease of description, only one thick line is depicted in fig. 9, but only one bus or one type of bus is not depicted.
The memory 901 stores a computer storage medium in which computer executable instructions for implementing the retrieval keyword extraction method according to the embodiment of the present application are stored. The processor 902 is configured to execute the above-described extraction method of the search keyword, as shown in fig. 2.
In another embodiment, the electronic device may be another electronic device, and referring to fig. 10, a schematic diagram of a hardware composition of another electronic device to which the embodiment of the present application is applied, where the electronic device may specifically be the client device 130 shown in fig. 1. In this embodiment, the electronic device may be configured as shown in FIG. 10, including a communication component 1010, a memory 1020, a display unit 1030, a camera 1040, a sensor 1050, an audio circuit 1060, a Bluetooth module 1070, a processor 1080, and the like.
The communication component 1010 is for communicating with a server. In some embodiments, a circuit wireless fidelity (WIRELESS FIDELITY, WIFI) module may be included, the WiFi module belongs to a short-range wireless transmission technology, and the electronic device may help the user to send and receive information through the WiFi module.
Memory 1020 may be used to store software programs and data. Processor 1080 performs various functions and data processing of client device 130 by executing software programs or data stored in memory 1020. The memory 1020 in the present application may store an operating system and various application programs, and may also store a computer program for executing the extraction method of the search keyword according to the embodiment of the present application.
The display unit 1030 may also be used to display information entered by a user or provided to a user as well as a graphical user interface (GRAPHICAL USER INTERFACE, GUI) of various menus of the client device 130. In particular, the display unit 1030 may include a display screen 1032 disposed on the front of the client device 130. The display unit 1030 may be used to display a page or the like that triggers a search operation in the embodiment of the present application.
The display unit 1030 may also be used to receive input numeric or character information, generate signal inputs related to user settings and function control of the client device 130, and in particular, the display unit 1030 may include a touch screen 1031 disposed on a front side of the client device 130, and may collect touch operations thereon or thereabout by a user.
The touch screen 1031 may be covered on the display screen 1032, or the touch screen 1031 may be integrated with the display screen 1032 to implement the input and output functions of the client device 130, and after integration, the touch screen may be simply referred to as a touch screen. The display unit 1030 may display an application program and corresponding operation steps in the present application.
The camera 1040 may be used to capture still images, and the user may comment the image captured by the camera 1040 through the application. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then passed to a processor 1080 for conversion into a digital image signal.
The client device may also include at least one sensor 1050, such as an acceleration sensor 1051, a distance sensor 1052, a fingerprint sensor 1053, and a temperature sensor 1054. The client device may also be configured with other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, light sensors, motion sensors, and the like.
Audio circuitry 1060, speakers 1061, microphone 1062 may provide an audio interface between a user and the client device 130. Audio circuit 1060 may transmit the received electrical signal after conversion of the audio data to speaker 1061 for conversion by speaker 1061 into an audio signal output. On the other hand, microphone 1062 converts the collected sound signals into electrical signals, which are received by audio circuitry 1060 and converted into audio data, which are output to communications component 1010 for transmission to, for example, another client device 130, or to memory 1020 for further processing.
The bluetooth module 1070 is used for exchanging information with other bluetooth devices having a bluetooth module through a bluetooth protocol.
Processor 1080 is a control center of the client device and connects the various parts of the overall terminal using various interfaces and lines, performs various functions of the client device and processes data by running or executing software programs stored in memory 1020 and invoking data stored in memory 1020. In some embodiments, processor 1080 may include at least one processing unit and processor 1080 may further integrate an application processor and a baseband processor. Processor 1080 of the present application may run an operating system, an application program, a user interface display, a touch response, and a method for extracting a search keyword according to an embodiment of the present application. In addition, a processor 1080 is coupled to the display unit 1030.
In some possible embodiments, aspects of the method for extracting a search keyword provided by the present application may also be implemented in the form of a program product, which includes a computer program for causing an electronic device to perform the steps in the method for extracting a search keyword according to the various exemplary embodiments of the present application described above when the program product is run on the electronic device, for example, the electronic device may perform the steps as shown in fig. 2.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of a readable storage medium include an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The program product of embodiments of the present application may take the form of a portable compact disc read only memory (CD-ROM) and comprise a computer program and may be run on an electronic device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.
The readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave in which a readable computer program is embodied. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.
A computer program embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer programs for performing the operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer program may execute entirely on the consumer electronic device, partly on the consumer electronic device, as a stand-alone software package, partly on the consumer electronic device and partly on a remote electronic device or entirely on the remote electronic device or server. In the case of remote electronic devices, the remote electronic device may be connected to the consumer electronic device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external electronic device (e.g., connected through the internet using an internet service provider).
It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the elements described above may be embodied in one element in accordance with embodiments of the present application. Conversely, the features and functions of one unit described above may be further divided into a plurality of units to be embodied.
Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having a computer-usable computer program embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program commands may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the commands executed by the processor of the computer or other programmable data processing apparatus produce means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (18)

1. The extraction method of the search key words is characterized by comprising the following steps:
acquiring a group of basic keywords extracted based on a search text, wherein the group of basic keywords comprises at least one basic keyword for indicating a search requirement;
Extracting document keywords with importance meeting a set condition from each candidate document, merging document keywords with continuous content by counting coexistence situations and word distances of every two document keywords in a single candidate document, and constructing the core keyword set based on each processed document keyword, wherein the importance is determined according to word frequency of the corresponding document keyword in the extracted candidate document and occurrence situations in each candidate document;
according to the similarity between the overall text features extracted by the group of basic keywords and the word features extracted by the core keywords, determining candidate keywords meeting preset screening conditions in the core keywords;
And determining the at least one basic keyword and the candidate keywords as search keywords.
2. The method of claim 1, wherein after determining the at least one base keyword and the candidate keywords as search keywords, the method further comprises:
in the search database, according to preset M search modes, combining the search keywords, and searching and determining each initial document respectively associated with at least one search keyword;
Obtaining a constructed document structure diagram, wherein the document structure diagram comprises nodes which are respectively constructed corresponding to various document data blocks in each candidate document and connecting edges used for representing a content link relation and a content attribution relation;
and clustering the initial documents with the link relation according to the document structure diagram to obtain target document sets of various associated search keywords, and respectively generating search results based on the target document sets.
3. The method as claimed in claim 2, wherein in the search database, each initial document associated with at least one search keyword is searched and determined according to a plurality of preset search modes by combining the search keywords, and the method comprises any one of the following operations:
in the search database, according to preset M search modes, combining the search keywords, and after M search processes are executed in series, obtaining initial documents which are determined by the last search process and are respectively associated with at least one search keyword;
In the search database, M search processes are executed in parallel according to M preset search modes and in combination with the search keywords respectively, at least one matching document of the associated content matching degree determined in each search process is obtained, and each initial document is screened out from the matching documents based on the at least one content matching degree associated with each matching document.
4. The method of claim 3, wherein the M search modes are obtained by at least one of:
acquiring each self-defined retrieval logic, and determining each retrieval mode according to each retrieval logic;
and acquiring each search module defined by an external system, and acquiring corresponding search modes by loading each search module.
5. A method as claimed in claim 3, characterized in that in any one of the search processes except the last search process, the following operations are performed:
determining each candidate document of the retrieval basis and a retrieval mode of the basis;
determining each matching document and the corresponding content matching degree in each candidate document according to each search keyword by adopting the search mode;
And respectively calculating the accumulated value of the content matching degree up to the current retrieval process aiming at each candidate document according to the basis, screening out the specified number of target documents with the highest accumulated value as each candidate document according to the next retrieval process, wherein the specified number is determined according to the execution sequence of the retrieval process.
6. The method of claim 3, wherein the screening each initial document from each matching document based on at least one content match associated with each matching document, comprises:
Determining at least one content matching degree associated with one matching document, determining target retrieval modes corresponding to the at least one content matching degree, and determining a matching degree fusion value corresponding to the one matching document according to mode weights respectively preset for the M retrieval modes and combining the content matching degrees under at least one target retrieval mode;
and determining each matching document with the corresponding matching degree fusion value reaching the preset condition as each initial document.
7. The method of claim 2, wherein clustering the initial documents with a link relationship according to the document structure map comprises:
for each initial document, the following is performed:
determining at least one hit keyword contained in one initial document among the search keywords;
For each hit keyword, determining a target node to which the hit keyword belongs in the document structure diagram, and clustering the one initial document and other initial documents when determining that the target node and a child node belonging to the target node have connecting edges linked to the other initial documents.
8. The method according to any one of claims 1-7, wherein prior to the obtaining a set of base keywords extracted based on the search text, comprising:
Responding to a search request triggered by a target object aiming at a search text, analyzing the search text to obtain each basic keyword and semantic roles corresponding to each basic keyword, wherein the semantic roles are used for indicating predicates and various modifier words associated with the predicates;
according to the semantic roles of the basic keywords, basic keywords of the corresponding predicates are determined, and aiming at the basic keywords of each corresponding predicate, the following operation is carried out, namely, a corresponding basic keyword phrase is constructed based on the basic keyword corresponding to one predicate and the basic keyword used for modifying the one predicate.
9. The method according to any of claims 1-7, wherein prior to said obtaining a set of basic keywords extracted based on the retrieved text, a set of core keywords is constructed in the following way:
extracting document keywords with importance meeting set conditions from the candidate documents respectively;
dividing every two coexisting document keywords in a single candidate document into a document keyword group;
counting two corresponding document keywords, determining an average word distance when the two document keywords appear in each candidate document, and merging the two document keywords into one document keyword when the two document keywords are determined to be continuous based on the average word distance;
And constructing a core keyword set based on the processed keywords of each document.
10. The method of claim 9, wherein the constructing a core keyword set based on the processed document keywords comprises:
for each processed document keyword, performing the operations of counting the total number of other keywords coexisting with one document keyword in a single candidate document among the processed document keywords;
Taking the processed document keywords as each graph node, and after establishing directed connection edges between graph nodes with differences in total numbers of other related keywords, obtaining a constructed keyword graph;
Combining each three graph nodes meeting a combination condition into graph nodes corresponding to corresponding combined keywords in the keyword graph, wherein the combination condition is that the preset three-node structure is met, and the combined keywords obtained by the corresponding three document keywords exist in each candidate document;
And constructing a core keyword set based on keywords corresponding to each graph node in the keyword graph.
11. The method of claim 10, wherein the three-node structure comprises any one of:
two graph nodes point to one and the same graph node;
the two graph nodes point to one and the same graph node, and a connecting edge exists between the two graph nodes.
12. The method of any of claims 1-7, wherein the importance of a document keyword is determined by:
calculating initial word frequency of one document keyword in a corresponding candidate document and inverse document frequency of the one document keyword in each candidate document;
And obtaining a target word frequency after the initial word frequency is subjected to value limiting, and taking the product result of the target word frequency and the inverse document frequency as the value of the importance of the document keyword.
13. An extraction device for a search keyword, comprising:
A first obtaining unit, configured to obtain a set of basic keywords extracted based on a search text, where the set of basic keywords includes at least one basic keyword for indicating a search requirement;
The second acquisition unit is used for acquiring a core keyword set constructed by each candidate document in the corresponding search database, wherein the core keyword set is obtained by respectively extracting document keywords with importance meeting a set condition from each candidate document, combining document keywords with continuous content by counting coexistence situations and word distances of every two document keywords in a single candidate document, and constructing the core keyword set based on each processed document keyword, and the importance is determined according to word frequency of the corresponding document keyword in the extracted candidate document and occurrence situations in each candidate document;
the screening unit is used for determining each candidate keyword which accords with a preset screening condition in each core keyword according to the similarity condition between the whole text characteristics extracted by the group of basic keywords and the word characteristics extracted by the corresponding core keywords;
and the determining unit is used for determining the at least one basic keyword and the candidate keywords as search keywords.
14. The apparatus of claim 13, wherein after the determining the at least one base keyword and the candidate keywords as search keywords, a search unit in the apparatus is configured to:
in the search database, according to preset M search modes, combining the search keywords, and searching and determining each initial document respectively associated with at least one search keyword;
Obtaining a constructed document structure diagram, wherein the document structure diagram comprises nodes which are respectively constructed corresponding to various document data blocks in each candidate document and connecting edges used for representing a content link relation and a content attribution relation;
and clustering the initial documents with the link relation according to the document structure diagram to obtain target document sets of various associated search keywords, and respectively generating search results based on the target document sets.
15. The apparatus of claim 14, wherein when the initial documents having the link relation are clustered according to the document structure map, the retrieving unit is configured to include:
for each initial document, the following is performed:
determining at least one hit keyword contained in one initial document among the search keywords;
For each hit keyword, determining a target node to which the hit keyword belongs in the document structure diagram, and clustering the one initial document and other initial documents when determining that the target node and a child node belonging to the target node have connecting edges linked to the other initial documents.
16. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1-12 when the computer program is executed by the processor.
17. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method according to any of the claims 1-12.
18. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-12.
CN202410702983.7A 2024-05-31 2024-05-31 A method, apparatus, electronic device, and storage medium for extracting search keywords. Pending CN121071128A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410702983.7A CN121071128A (en) 2024-05-31 2024-05-31 A method, apparatus, electronic device, and storage medium for extracting search keywords.

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410702983.7A CN121071128A (en) 2024-05-31 2024-05-31 A method, apparatus, electronic device, and storage medium for extracting search keywords.

Publications (1)

Publication Number Publication Date
CN121071128A true CN121071128A (en) 2025-12-05

Family

ID=97840782

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410702983.7A Pending CN121071128A (en) 2024-05-31 2024-05-31 A method, apparatus, electronic device, and storage medium for extracting search keywords.

Country Status (1)

Country Link
CN (1) CN121071128A (en)

Similar Documents

Publication Publication Date Title
JP6894534B2 (en) Information processing method and terminal, computer storage medium
CN110909182B (en) Multimedia resource searching method, device, computer equipment and storage medium
US9239875B2 (en) Method for disambiguated features in unstructured text
CN110597963B (en) Expression question-answering library construction method, expression search device and storage medium
TW202020691A (en) Feature word determination method and device and server
WO2013170587A1 (en) Multimedia question and answer system and method
CN111190997A (en) Question-answering system implementation method using neural network and machine learning sequencing algorithm
CN115114395B (en) Content retrieval and model training method and device, electronic equipment and storage medium
CN109325201A (en) Method, device, device and storage medium for generating entity relationship data
CN113806588B (en) Method and device for searching videos
CN116414961A (en) Question answering method and system based on knowledge graph in military field
CN113704623A (en) Data recommendation method, device, equipment and storage medium
CN118861193B (en) Search term analysis model data processing method, device and computer equipment
CN110209781B (en) Text processing method and device and related equipment
WO2025092584A1 (en) Method and apparatus for generating interaction component of client ui, terminal, and medium
CN109271624A (en) A kind of target word determines method, apparatus and storage medium
CN115293127A (en) Contract document information comparison method, device and system
WO2015084757A1 (en) Systems and methods for processing data stored in a database
CN117573145A (en) Automatic game release method, system, equipment and medium
US11314793B2 (en) Query processing
CN113505889B (en) Processing method and device of mapping knowledge base, computer equipment and storage medium
CN115270777A (en) A method, device and system for extracting contract document information
CN120196735A (en) A method, device, equipment and storage medium for determining question and answer generated by retrieval enhancement
KR20230053361A (en) Method, apparatus and computer-readable recording medium for generating product images displayed in an internet shopping mall based on an input image
CN115438221A (en) Recommendation method, device and electronic equipment based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication