CN103559313A - Searching method and device - Google Patents
Searching method and device Download PDFInfo
- Publication number
- CN103559313A CN103559313A CN201310586096.XA CN201310586096A CN103559313A CN 103559313 A CN103559313 A CN 103559313A CN 201310586096 A CN201310586096 A CN 201310586096A CN 103559313 A CN103559313 A CN 103559313A
- Authority
- CN
- China
- Prior art keywords
- search
- dictionary
- search word
- client
- acquiescence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种搜索方法及装置。其中,该搜索方法包括:获得默认词库;统计用户通过客户端发送的各搜索词的次数,将次数大于预定值的搜索词添加到所述默认词库中,获得当前词库;接收用户通过客户端发送的搜索词,在当前词库中搜索该搜索词,获得搜索结果,并向所述客户端返回该搜索结果以用于向用户展示。上述搜索方法及装置,通过统计用户发送的各搜索词的次数,将次数大于预定值的搜索词添加到默认词库中,让热门的词更容易命中相关的资料,从而可以提升搜索命中率。
The invention discloses a search method and device. Wherein, the search method includes: obtaining a default thesaurus; counting the number of times of each search word sent by the user through the client, adding the search words whose number of times is greater than a predetermined value to the default thesaurus to obtain the current thesaurus; The search term sent by the client is searched for the search term in the current thesaurus to obtain the search result, and the search result is returned to the client for display to the user. The above search method and device, by counting the number of times of each search word sent by the user, add the search words whose number of times is greater than a predetermined value to the default thesaurus, making it easier for popular words to hit relevant information, thereby improving the search hit rate.
Description
技术领域technical field
本发明涉及计算机技术,具体涉及一种搜索方法及装置。The invention relates to computer technology, in particular to a search method and device.
背景技术Background technique
搜索引擎的出现,整合了众多网站信息,起到了信息导航的作用。搜索引擎分为垂直搜索引擎和通用搜索引擎两种:The emergence of search engines has integrated the information of many websites and played the role of information navigation. Search engines are divided into vertical search engines and general search engines:
通用搜索引擎就如同互联网第一次出现的门户网站一样,大量的信息整合导航,极快的查询,将所有网站上的信息整理在一个平台上供用户使用,于是信息的价值第一次普遍地被众多商家认可,迅速成为互联网中最有价值的领域;The general search engine is just like the portal website that appeared for the first time on the Internet. A large amount of information is integrated and navigated, and the query is extremely fast. Recognized by many merchants, it quickly became the most valuable field in the Internet;
垂直搜索引擎是针对某一个行业的专业搜索引擎,是搜索引擎的细分和延伸,是对网页库中的某类专门的信息进行一次整合,定向分字段抽取出需要的数据进行处理后再以某种形式返回给用户。A vertical search engine is a professional search engine for a certain industry. It is a subdivision and extension of a search engine. It is an integration of a certain type of specialized information in the webpage library, and the required data is extracted by directional sub-fields for processing. Some form is returned to the user.
垂直搜索是相对通用搜索引擎的信息量大、查询不准确、深度不够等提出来的新的搜索引擎服务模式,通过针对某一特定领域、某一特定人群或某一特定需求提供的有一定价值的信息和相关服务。其特点就是“专、精、深”,且具有行业色彩,相比较通用搜索引擎的海量信息无序化,垂直搜索引擎则显得更加专注、具体和深入。Vertical search is a new search engine service model proposed relative to the large amount of information, inaccurate query, and insufficient depth of general search engines. information and related services. It is characterized by "specialization, precision, and depth" and has an industry color. Compared with the disordered massive information of general search engines, vertical search engines are more focused, specific, and in-depth.
现有的垂直搜索命中比率对词库的依赖较大,准确的词库才能获得更好搜索体验,因此,需要一个比较完善并且更新快捷的词库。The existing vertical search hit ratio relies heavily on the thesaurus, and an accurate thesaurus can provide a better search experience. Therefore, a relatively complete and updated thesaurus is needed.
发明内容Contents of the invention
鉴于上述问题,提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的搜索方法及装置。In view of the above problems, the present invention is proposed to provide a search method and device for overcoming the above problems or at least partially solving the above problems.
根据本发明的一个方面,提供了一种搜索方法,包括:According to one aspect of the present invention, a search method is provided, including:
获得默认词库;get default thesaurus;
统计用户通过客户端发送的各搜索词的次数,将次数大于预定值的搜索词添加到所述默认词库中,获得当前词库;Counting the number of times of each search word sent by the user through the client, adding the search word whose number of times is greater than a predetermined value to the default thesaurus to obtain the current thesaurus;
接收用户通过客户端发送的搜索词,在当前词库中搜索该搜索词,获得搜索结果,并向所述客户端返回该搜索结果以用于向用户展示。Receive the search term sent by the user through the client, search for the search term in the current thesaurus, obtain the search result, and return the search result to the client for display to the user.
根据本发明的另一方面,提供了一种搜索装置,包括:According to another aspect of the present invention, a search device is provided, including:
获得模块,适于获得默认词库;Obtaining modules, suitable for obtaining default thesaurus;
添加模块,适于统计用户通过客户端发送的各搜索词的次数,将次数大于预定值的搜索词添加到所述默认词库中,获得当前词库;The adding module is suitable for counting the number of times of each search term sent by the user through the client, and adding the search term whose number of times is greater than a predetermined value to the default thesaurus to obtain the current thesaurus;
搜索模块,适于接收用户通过客户端发送的搜索词,在当前词库中搜索该搜索词,获得搜索结果,并向所述客户端返回该搜索结果以用于向用户展示。The search module is adapted to receive the search term sent by the user through the client, search for the search term in the current thesaurus, obtain the search result, and return the search result to the client for presentation to the user.
上述搜索方法及装置,通过统计用户发送的各搜索词的次数,将次数大于预定值的搜索词添加到默认词库中,让热门的词更容易命中相关的资料,从而可以提升搜索命中率。The above search method and device, by counting the number of times of each search word sent by the user, add the search words whose number of times is greater than a predetermined value to the default thesaurus, making it easier for popular words to hit relevant information, thereby improving the search hit rate.
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。The above description is only an overview of the technical solution of the present invention. In order to better understand the technical means of the present invention, it can be implemented according to the contents of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and understandable , the specific embodiments of the present invention are enumerated below.
附图说明Description of drawings
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiment. The drawings are only for the purpose of illustrating a preferred embodiment and are not to be considered as limiting the invention. Also throughout the drawings, the same reference numerals are used to designate the same parts. In the attached picture:
图1a示出了根据本发明一个实施例的搜索方法的流程图;Figure 1a shows a flowchart of a search method according to an embodiment of the present invention;
图1b示出了根据本发明另一个实施例的搜索方法的流程图;Figure 1b shows a flowchart of a search method according to another embodiment of the present invention;
图2示出了根据本发明另一个实施例的搜索方法的流程图;Fig. 2 shows a flowchart of a search method according to another embodiment of the present invention;
图3示出了根据本发明一个实施例的搜索装置的结构示意图;FIG. 3 shows a schematic structural diagram of a search device according to an embodiment of the present invention;
图4示出了根据本发明另一个实施例的搜索装置的结构示意图。Fig. 4 shows a schematic structural diagram of a search device according to another embodiment of the present invention.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided for more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.
图1a示出了根据本发明一个实施例的搜索方法的流程图。如图1a所示,该搜索方法包括:Fig. 1a shows a flowchart of a search method according to an embodiment of the present invention. As shown in Figure 1a, the search method includes:
步骤S101、获得默认词库;Step S101, obtaining a default thesaurus;
该默认词库为对从互联网上抓取的网页进行解析、提取和过滤处理,然后对处理后的网页内容进行分词处理获得的;The default thesaurus is obtained by parsing, extracting and filtering the webpages crawled from the Internet, and then performing word segmentation on the processed webpage content;
其中,该默认词库包括不同类别的默认词库;例如游戏词库包括武侠类游戏词库、模拟经营类游戏词库等;Wherein, the default lexicon includes default lexicons of different categories; for example, the game lexicon includes a martial arts game lexicon, a simulation business game lexicon, etc.;
步骤S103、统计用户通过客户端发送的各搜索词的次数,将次数大于预定值的搜索词添加到默认词库中,获得当前词库;Step S103, counting the number of times of each search word sent by the user through the client, and adding the search word whose number of times is greater than a predetermined value to the default thesaurus to obtain the current thesaurus;
该步骤S103包括:统计用户通过客户端发送的各搜索词的次数,判断搜索词对应的类别,将该类别中次数大于预定值的搜索词添加到对应类别的默认词库中,获得对应类别的当前词库。This step S103 includes: counting the number of times of each search term sent by the user through the client, determining the category corresponding to the search term, adding the search term in the category whose frequency is greater than a predetermined value to the default thesaurus of the corresponding category, and obtaining the corresponding category. current thesaurus.
由于搜索词保存在日志中,所以可以通过每小时的脚本文件将保存在日志中的搜索词写入词表,如果词表中没有这个词,则加入词表,如果词表中已经有这个词,则将该词的次数加一。Since the search term is saved in the log, the search term saved in the log can be written into the vocabulary through an hourly script file. If the word does not exist in the vocabulary, it will be added to the vocabulary. If the word already exists in the vocabulary , then add one to the count of the word.
其中,从日志中获取搜索词的实现代码如下:Among them, the implementation code for obtaining search terms from logs is as follows:
将获取的搜索词写入词表,如果词表中没有这个词,则加入词表,如果词表中已经有这个词,则将该词的次数加一,具体实现代码如下:Write the acquired search word into the vocabulary, if the word does not exist in the vocabulary, add it to the vocabulary, if the word already exists in the vocabulary, add one to the number of times of the word, the specific implementation code is as follows:
上述词表包括关键词、对应关键词的次数和词库分割行,其中上述关键词可以为中文词。The above-mentioned vocabulary includes keywords, times of corresponding keywords and thesaurus segmentation lines, wherein the above-mentioned keywords can be Chinese words.
上述词表的格式如表1所示:The format of the above vocabulary is shown in Table 1:
表1词表的格式Table 1 Format of vocabulary
判断各搜索词对应的类别具体可根据各搜索词与不同类别词库的相关度来确定,当当前搜索词与某个或某些类别的词库相关度大于预定值时,可将该当前搜索词的类别确定为与这些对应词库具有相同的类别,也可以基于经验算法通过当前搜索词确定出其对应的类别。该类别包括各种应用的类别,例如游戏类别等。Judging the category corresponding to each search word can be determined according to the degree of correlation between each search word and thesaurus of different categories. The categories of the words are determined to have the same category as these corresponding thesauruses, and the corresponding categories can also be determined through the current search word based on an empirical algorithm. The category includes categories of various applications, such as a game category, and the like.
将包含有某一类别中次数大于预定值的搜索词的词表添加到对应类别的默认词库中的实现代码如下:The implementation code for adding a vocabulary containing search words whose number of times in a certain category is greater than a predetermined value to the default vocabulary of the corresponding category is as follows:
步骤S105、接收用户通过客户端发送的搜索词,在当前词库中搜索该搜索词,获得搜索结果,并向客户端返回该搜索结果以用于向用户展示。Step S105, receiving the search term sent by the user through the client, searching for the search term in the current thesaurus to obtain a search result, and returning the search result to the client for display to the user.
另外,在获得当前词库后,该方法还可以包括:步骤S104、更新当前词库的索引,如图1b所示。In addition, after obtaining the current thesaurus, the method may further include: Step S104, updating the index of the current thesaurus, as shown in FIG. 1b.
若在步骤S105之前更新了当前词库的索引,则可以在更新索引后的当前词库中搜索用户输入的搜索词,获得搜索结果。If the index of the current thesaurus is updated before step S105, the search word input by the user may be searched in the current thesaurus after the index is updated to obtain the search result.
通过上述描述可以知道本发明实施例涉及的搜索为垂直搜索,即对某一领域例如游戏领域进行的搜索,由于垂直搜索命中比率对词库的依赖较大,因此,一个比较完善并且更新快捷的词库显得尤为重要,而本发明实施例可以方便、快捷地更新词库,从而可以获得更好的搜索体验。From the above description, it can be known that the search involved in the embodiment of the present invention is a vertical search, that is, a search for a certain field such as a game field. Since the hit ratio of the vertical search depends heavily on the thesaurus, a relatively complete and fast update Thesaurus is particularly important, and the embodiment of the present invention can update the thesaurus conveniently and quickly, so as to obtain a better search experience.
上述搜索方法尤其适用于时效性强的领域,例如游戏领域。上述搜索方法,通过统计用户发送的各搜索词的次数,将次数大于预定值的搜索词添加到默认词库中,让热门的词更容易命中相关的资料,从而可以提升搜索命中率。The above search method is especially suitable for fields with strong timeliness, such as the field of games. In the above search method, by counting the number of times of each search word sent by the user, the search words whose number of times is greater than a predetermined value are added to the default thesaurus, so that popular words can more easily hit relevant information, thereby improving the search hit rate.
图2示出了根据本发明另一个实施例的搜索方法的流程图。如图2所示,该方法包括:Fig. 2 shows a flowchart of a search method according to another embodiment of the present invention. As shown in Figure 2, the method includes:
步骤S201、获得需要向默认词库中添加的词表;Step S201, obtaining the vocabulary that needs to be added to the default vocabulary;
在该步骤之前,首先统计好用户通过客户端发送的各搜索词的次数,具体实现方法可以为:通过每小时的脚本文件将保存在日志中的搜索词写入词表,如果词表中没有这个词,则加入词表,如果词表中已经有这个词,则将该词的次数加一;然后将次数大于预定值的搜索词保留在词表中,并从词表中删除次数小于该预定值的搜索词。Before this step, first count the number of search words sent by the user through the client. The specific implementation method can be: write the search words stored in the log into the vocabulary through an hourly script file, if there is no word in the vocabulary This word is then added to the vocabulary, if the word already exists in the vocabulary, then the number of times of the word is increased by one; then the search word whose number of times is greater than the predetermined value is kept in the word list, and the number of times less than the number of times is deleted from the word list A search term for a predetermined value.
假设,当前词表如表2所示;Assume that the current vocabulary is as shown in Table 2;
表2当前词表Table 2 Current vocabulary
上述预定值可以根据需要进行设置,例如可以设置为5次;此时,可以向对应默认词库中添加的词表如表3所示;当然,也可以设置为其他值,例如8次,但若设置为8次,可以向对应默认词库中添加的词表如表4所示。The above predetermined value can be set as required, for example, it can be set to 5 times; at this time, the vocabulary that can be added to the corresponding default lexicon is shown in Table 3; of course, it can also be set to other values, such as 8 times, but If it is set to 8 times, the vocabulary that can be added to the corresponding default vocabulary is shown in Table 4.
表3向默认词库中添加的一词表Table 3 A vocabulary added to the default lexicon
表4向默认词库中添加的另一词表Table 4 Another vocabulary added to the default vocabulary
步骤S202、对词表格式进行处理,将词表生成符合预定格式要求的词表文本;Step S202, processing the format of the vocabulary, generating a vocabulary text that meets the predetermined format requirements from the vocabulary;
该预定格式可以根据需要灵活设置,例如可以设置为mmseg格式或其它格式,mmseg是中文分词中一个常见的、基于词典的分词系统,它以正向最大匹配为主,多种打消歧义的规矩为辅,因为它的实现简单,运行速度较快,所以结果相对较好,应用较广;该分词系统通常包括一个词典,两种匹配算法以及四种歧义消解规则。The predetermined format can be flexibly set according to needs, for example, it can be set to mmseg format or other formats. mmseg is a common dictionary-based word segmentation system in Chinese word segmentation. It focuses on positive maximum matching, and the rules for disambiguation are Auxiliary, because of its simple implementation and fast running speed, the result is relatively good and its application is wide; the word segmentation system usually includes a dictionary, two matching algorithms and four ambiguity resolution rules.
例如,可以将表3转换成以下格式:For example, Table 3 can be converted into the following format:
倚天剑[tab]10Yitian Sword[tab]10
x:1x:1
陷害卡[tab]6Framed card[tab]6
x:1x:1
其他的行Other lines
可以将表4转换成以下格式:Table 4 can be converted into the following format:
倚天剑[tab]10Yitian Sword[tab]10
x:1x:1
其他的行Other lines
步骤S203、将生成的符合预定格式要求的词表文本添加到对应类别的原词表文本unigram.txt的末尾,保存为新的词表文本unigram_new.txt,并拷贝到mmseg所在的目录下,生成新的词库;Step S203, add the generated vocabulary text that meets the predetermined format requirements to the end of the original vocabulary text unigram.txt of the corresponding category, save it as a new vocabulary text unigram_new.txt, and copy it to the directory where mmseg is located to generate new thesaurus;
例如采用以下方式,就可以生成新的词库unigram_new.txt.uni:For example, a new thesaurus unigram_new.txt.uni can be generated in the following way:
/usr/local/mmseg3/bin/mmseg-u/usr/local/mmseg3/etc/unigram_new.txt/usr/local/mmseg3/bin/mmseg-u /usr/local/mmseg3/etc/unigram_new.txt
在本实施例中,假定设定的预定值为5次,故可以将倚天剑添加到倚天屠龙记等武侠类游戏的词库中,可以将陷害卡添加到大富翁等经营类游戏的词库中,即不同类型的游戏有自己的默认词库,用户输入的高频率的搜索词只能添加到对应类型的游戏的词库中。通过上述步骤,使得一款新游戏发布后,对应该游戏的词库也会更完善。In this embodiment, it is assumed that the set predetermined value is 5 times, so the Yitian sword can be added to the lexicon of martial arts games such as Yitian Tulongji, and the framing card can be added to the lexicon of business games such as Monopoly , that is, different types of games have their own default thesaurus, and the high-frequency search words input by the user can only be added to the thesaurus of the corresponding type of game. Through the above steps, after a new game is released, the lexicon corresponding to the game will be more complete.
步骤S204、使用新的词库替换默认词库;Step S204, using a new thesaurus to replace the default thesaurus;
例如可以采用以下方式实现替换,具体为:For example, the replacement can be implemented in the following ways, specifically:
mv/usr/local/mmseg3/etc/unigram_new.txt.uni/usr/local/mmseg3/etc/uni.libmv /usr/local/mmseg3/etc/unigram_new.txt.uni/usr/local/mmseg3/etc/uni.lib
经过上述步骤S201-204,可以较好地实现将用户比较感兴趣的词即搜索次数大于预定值的词添加到对应的默认词库中;After the above steps S201-204, the words that the user is more interested in, that is, the words whose search times are greater than the predetermined value, can be better added to the corresponding default thesaurus;
步骤S205、定时更新当前词库的索引,重启搜索组件searchd;Step S205, periodically updating the index of the current thesaurus, restarting the search component searchd;
具体实现方式如下:The specific implementation is as follows:
/usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/c.conf -all-pidfile -rotate/usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/c.conf -all-pidfile -rotate
关闭searchdclose searchd
ps auxww|grep searchdps auxww grep searchd
kill923230kill923230
启动searchdstart searchd
/usr/local/coreseek/bin/searchd -c/usr/local/coreseek/etc/c.conf -console-pidfile/usr/local/coreseek/bin/searchd -c /usr/local/coreseek/etc/c.conf -console-pidfile
其中,searchd是实际上处理搜索的组件,运行时它表现得像一种服务,它与客户端应用程序调用的各种应用程序接口(API)进行通讯,负责接受查询、处理查询和返回数据集。Among them, searchd is the component that actually handles the search. When it runs, it behaves like a service. It communicates with various application program interfaces (APIs) called by client applications, and is responsible for accepting queries, processing queries, and returning data sets. .
不同于索引器(indexer),searchd并不是用来在命令行或者一般的脚本中调用的,相反,它或者作为一个守护程序(daemon)被init.d调用(在Unix/Linux类系统上),或者作为一种服务(在Windows类系统上)被使用,因此并不是所有的命令行选项都总是有效,这与构建时的选项有关。Unlike the indexer (indexer), searchd is not used to be called from the command line or a general script. Instead, it is called as a daemon by init.d (on Unix/Linux-like systems), Or is used as a service (on Windows-like systems), so not all command-line options are always available, it's about build-time options.
步骤S206、根据接收到的关键词返回搜索结果。Step S206, returning search results according to the received keywords.
更新词库的索引并重启搜索组件后即可进行搜索操作。After updating the index of the thesaurus and restarting the search component, the search operation can be performed.
通过上述描述可以知道本发明实施例涉及的搜索为垂直搜索,即对某一领域例如游戏领域进行的搜索,由于垂直搜索命中比率对词库的依赖较大,因此,一个比较完善并且更新快捷的词库显得尤为重要,而本发明实施例可以方便、快捷地更新词库,从而可以获得更好搜索体验。From the above description, it can be known that the search involved in the embodiment of the present invention is a vertical search, that is, a search for a certain field such as a game field. Since the hit ratio of the vertical search depends heavily on the thesaurus, a relatively complete and fast update Thesaurus is particularly important, and the embodiment of the present invention can update the thesaurus conveniently and quickly, so as to obtain a better search experience.
上述搜索方法,通过统计用户发送的各搜索词的次数,将次数大于预定值的搜索词添加到默认词库中,让热门的词更容易命中相关的资料,从而可以提升搜索命中率。In the above search method, by counting the number of times of each search word sent by the user, the search words whose number of times is greater than a predetermined value are added to the default thesaurus, so that popular words can more easily hit relevant information, thereby improving the search hit rate.
图3示出了根据本发明一个实施例的搜索装置的结构示意图。如图3所示,该搜索装置包括:获得模块31、添加模块32和搜索模块33,其中:Fig. 3 shows a schematic structural diagram of a search device according to an embodiment of the present invention. As shown in Figure 3, the search device includes: an obtaining
获得模块31适于获得默认词库。添加模块32适于统计用户通过客户端发送的各搜索词的次数,将次数大于预定值的搜索词添加到上述默认词库中,获得当前词库。搜索模块33适于接收用户通过客户端发送的搜索词,在当前词库中搜索该搜索词,获得搜索结果,并向上述客户端返回该搜索结果以用于向用户展示。The obtaining
其中,上述默认词库为对从互联网上抓取的网页进行解析、提取和过滤处理,然后对处理后的网页内容进行分词处理获得的;该获得模块具体适于获得不同类别的默认词库。例如游戏词库包括武侠类游戏词库、模拟经营类游戏词库等。Wherein, the above-mentioned default thesaurus is obtained by parsing, extracting and filtering webpages captured from the Internet, and then performing word segmentation on the processed webpage content; the obtaining module is specifically suitable for obtaining default thesaurus of different categories. For example, the game lexicon includes a martial arts game lexicon, a business simulation game lexicon, and the like.
由于搜索词保存在日志中,所以可以通过每小时的脚本文件将保存在日志中的搜索词写入词表,如果词表中没有这个词,则加入词表,如果词表中已经有这个词,则将该词的次数加一,从而完成对各搜索词次数的统计;同时判断出搜索词对应的类别,将次数大于预定值的搜索词添加到对应类别的默认词库中,生成对应类别的当前词库。上述词表中可以包括关键词、对应关键词的次数和词库分割行,词表的格式可参见表1。其中上述关键词可以为中文词、英文词或其他词语。Since the search term is saved in the log, the search term saved in the log can be written into the vocabulary through an hourly script file. If the word does not exist in the vocabulary, it will be added to the vocabulary. If the word already exists in the vocabulary , then add one to the number of times of the word, so as to complete the statistics of the number of times of each search word; at the same time, determine the category corresponding to the search word, add the search word whose number is greater than the predetermined value to the default thesaurus of the corresponding category, and generate the corresponding category of the current vocabulary. The above-mentioned vocabulary may include keywords, times corresponding to keywords, and thesaurus segmentation lines. The format of the vocabulary can be found in Table 1. Wherein the above-mentioned keywords can be Chinese words, English words or other words.
具体地,判断各搜索词对应的类别可根据各搜索词与不同类别词库的相关度来确定,当搜索词与某个或某些类别的词库相关度大于预定值时,可将该搜索词的类别确定为与这些对应词库具有相同的类别,也可以基于经验算法通过当前搜索词确定出其对应的类别。该类别包括各种应用的类别,例如游戏类别等。Specifically, judging the category corresponding to each search word can be determined according to the degree of correlation between each search word and the thesaurus of different categories. The categories of the words are determined to have the same category as these corresponding thesauruses, and the corresponding categories can also be determined through the current search word based on an empirical algorithm. The category includes categories of various applications, such as a game category, and the like.
另外,在生成对应类别的当前词库的过程中,需要对词表格式进行处理,将词表生成符合预定格式要求的词表文本;该预定格式可以为mmseg格式或其他格式。具体地,将生成的符合预定格式要求的词表文本添加到对应类别的原词表文本unigram.txt的末尾,保存为新的词表文本unigram_new.txt,并拷贝到mmseg所在的目录下,从而生成新的词库。In addition, in the process of generating the current thesaurus corresponding to the category, the format of the vocabulary needs to be processed, and the vocabulary is generated into a vocabulary text that meets the requirements of a predetermined format; the predetermined format can be mmseg format or other formats. Specifically, add the generated vocabulary text that meets the predetermined format requirements to the end of the original vocabulary text unigram.txt of the corresponding category, save it as a new vocabulary text unigram_new.txt, and copy it to the directory where mmseg is located, thereby Generate new thesaurus.
进一步地,该搜索装置还可以包括:更新模块34,如图4所示,该更新模块适于在添加模块32获得当前词库之后,更新该当前词库的索引,以便搜索模块33在接收用户通过客户端发送的搜索词之后,在更新索引后的当前词库中搜索该搜索词,获得搜索结果,并向上述客户端返回该搜索结果以用于向用户展示。Further, the search device may also include: an updating
上述搜索装置尤其适用于时效性强的领域,例如游戏领域。The above search device is especially suitable for fields with strong timeliness, such as the field of games.
通过上述描述可以知道本发明实施例涉及的搜索为垂直搜索,即对某一领域例如游戏领域进行的搜索,由于垂直搜索命中比率对词库的依赖较大,因此,一个比较完善并且更新快捷的词库显得尤为重要,而本发明实施例可以方便、快捷地更新词库,从而可以获得更好搜索体验。From the above description, it can be known that the search involved in the embodiment of the present invention is a vertical search, that is, a search for a certain field such as a game field. Since the hit ratio of the vertical search depends heavily on the thesaurus, a relatively complete and fast update Thesaurus is particularly important, and the embodiment of the present invention can update the thesaurus conveniently and quickly, so as to obtain a better search experience.
上述搜索装置,通过统计用户发送的各搜索词的次数,将次数大于预定值的搜索词添加到默认词库中,让热门的词更容易命中相关的资料,从而可以提升搜索命中率。The above-mentioned search device adds the search words whose frequency is greater than a predetermined value to the default thesaurus by counting the number of times of each search word sent by the user, so that popular words can more easily hit relevant information, thereby improving the search hit rate.
在此提供的算法和显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与基于在此的示教一起使用。根据上面的描述,构造这类系统所要求的结构是显而易见的。此外,本发明也不针对任何特定编程语言。应当明白,可以利用各种编程语言实现在此描述的本发明的内容,并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。The algorithms and displays presented herein are not inherently related to any particular computer, virtual system, or other device. Various generic systems can also be used with the teachings based on this. The structure required to construct such a system is apparent from the above description. Furthermore, the present invention is not specific to any particular programming language. It should be understood that various programming languages can be used to implement the content of the present invention described herein, and the above description of specific languages is for disclosing the best mode of the present invention.
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.
类似地,应当理解,为了精简本公开并帮助理解各个发明方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如下面的权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, in order to streamline this disclosure and to facilitate an understanding of one or more of the various inventive aspects, various features of the invention are sometimes grouped together in a single embodiment, figure, or its description. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.
本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art can understand that the modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment. Modules or units or components in the embodiments may be combined into one module or unit or component, and furthermore may be divided into a plurality of sub-modules or sub-units or sub-assemblies. All features disclosed in this specification (including accompanying claims, abstract and drawings), as well as any method or method so disclosed, may be used in any combination, except that at least some of such features and/or processes or units are mutually exclusive. All processes or units of equipment are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在下面的权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。Furthermore, those skilled in the art will understand that although some embodiments described herein include some features included in other embodiments but not others, combinations of features from different embodiments are meant to be within the scope of the invention. and form different embodiments. For example, in the following claims, any one of the claimed embodiments may be used in any combination.
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的搜索装置中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。The various component embodiments of the present invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all functions of some or all components in the search device according to the embodiments of the present invention. The present invention can also be implemented as an apparatus or an apparatus program (for example, a computer program and a computer program product) for performing a part or all of the methods described herein. Such a program for realizing the present invention may be stored on a computer-readable medium, or may be in the form of one or more signals. Such a signal may be downloaded from an Internet site, or provided on a carrier signal, or provided in any other form.
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. does not indicate any order. These words can be interpreted as names.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201310586096.XA CN103559313B (en) | 2013-11-20 | 2013-11-20 | Searching method and device |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201310586096.XA CN103559313B (en) | 2013-11-20 | 2013-11-20 | Searching method and device |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN103559313A true CN103559313A (en) | 2014-02-05 |
| CN103559313B CN103559313B (en) | 2018-02-23 |
Family
ID=50013559
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201310586096.XA Active CN103559313B (en) | 2013-11-20 | 2013-11-20 | Searching method and device |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN103559313B (en) |
Cited By (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105404661A (en) * | 2015-11-05 | 2016-03-16 | 浪潮(北京)电子信息产业有限公司 | Index file updating method and system |
| CN105893626A (en) * | 2016-05-10 | 2016-08-24 | 中广核工程有限公司 | Index library creation method used for nuclear power engineering and index system adopting index library creation method |
| CN106502980A (en) * | 2016-10-09 | 2017-03-15 | 武汉斗鱼网络科技有限公司 | A kind of search method and system based on text morpheme cutting |
| CN107247798A (en) * | 2017-06-27 | 2017-10-13 | 北京京东尚科信息技术有限公司 | The method and apparatus for building search dictionary |
| WO2019056958A1 (en) * | 2017-09-22 | 2019-03-28 | 阿里巴巴集团控股有限公司 | Trending keyword acquisition method, device and server |
| CN106971000B (en) * | 2017-04-12 | 2020-04-28 | 北京焦点新干线信息技术有限公司 | A search method and device |
| CN112507181A (en) * | 2019-09-16 | 2021-03-16 | 百度在线网络技术(北京)有限公司 | Search request classification method and device, electronic equipment and storage medium |
| CN115587243A (en) * | 2022-09-28 | 2023-01-10 | 云南腾云信息产业有限公司 | Multi-mode rich search word bank optimized search word segmentation method, device, equipment and storage medium |
| CN116955284A (en) * | 2023-07-26 | 2023-10-27 | 中国银行股份有限公司 | Log retrieval word stock updating method, device, equipment and medium |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1763739A (en) * | 2004-10-21 | 2006-04-26 | 北京大学 | Semantic-Based Retrieval Method in Search Engine |
| CN1936893A (en) * | 2006-06-02 | 2007-03-28 | 北京搜狗科技发展有限公司 | Method and system for generating input-method word frequency base based on internet information |
| CN101038596A (en) * | 2007-04-29 | 2007-09-19 | 北京搜狗科技发展有限公司 | Method and system for classifying website |
| CN101079056A (en) * | 2007-02-06 | 2007-11-28 | 腾讯科技(深圳)有限公司 | Retrieving method and system |
| US20100114878A1 (en) * | 2008-10-22 | 2010-05-06 | Yumao Lu | Selective term weighting for web search based on automatic semantic parsing |
| CN102289436A (en) * | 2010-06-18 | 2011-12-21 | 阿里巴巴集团控股有限公司 | Method and device for determining weighted value of search term and method and device for generating search results |
| CN103064838A (en) * | 2011-10-19 | 2013-04-24 | 阿里巴巴集团控股有限公司 | Data searching method and device |
| CN103106227A (en) * | 2012-08-03 | 2013-05-15 | 人民搜索网络股份公司 | System and method of looking up new word based on webpage text |
-
2013
- 2013-11-20 CN CN201310586096.XA patent/CN103559313B/en active Active
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1763739A (en) * | 2004-10-21 | 2006-04-26 | 北京大学 | Semantic-Based Retrieval Method in Search Engine |
| CN1936893A (en) * | 2006-06-02 | 2007-03-28 | 北京搜狗科技发展有限公司 | Method and system for generating input-method word frequency base based on internet information |
| CN101079056A (en) * | 2007-02-06 | 2007-11-28 | 腾讯科技(深圳)有限公司 | Retrieving method and system |
| CN101038596A (en) * | 2007-04-29 | 2007-09-19 | 北京搜狗科技发展有限公司 | Method and system for classifying website |
| US20100114878A1 (en) * | 2008-10-22 | 2010-05-06 | Yumao Lu | Selective term weighting for web search based on automatic semantic parsing |
| CN102289436A (en) * | 2010-06-18 | 2011-12-21 | 阿里巴巴集团控股有限公司 | Method and device for determining weighted value of search term and method and device for generating search results |
| CN103064838A (en) * | 2011-10-19 | 2013-04-24 | 阿里巴巴集团控股有限公司 | Data searching method and device |
| CN103106227A (en) * | 2012-08-03 | 2013-05-15 | 人民搜索网络股份公司 | System and method of looking up new word based on webpage text |
Non-Patent Citations (2)
| Title |
|---|
| 王一丁Z: "为coreseek添加mmseg分词", 《HTTP://MY.OSCHINA.NET/U/660307/BLOG/158440》, 1 September 2013 (2013-09-01) * |
| 陈红涛 等: "基于大规模中文搜索引擎的搜索日志挖掘", 《计算机应用研究》, vol. 25, no. 6, 11 August 2008 (2008-08-11), pages 1663 - 1665 * |
Cited By (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105404661A (en) * | 2015-11-05 | 2016-03-16 | 浪潮(北京)电子信息产业有限公司 | Index file updating method and system |
| CN105893626A (en) * | 2016-05-10 | 2016-08-24 | 中广核工程有限公司 | Index library creation method used for nuclear power engineering and index system adopting index library creation method |
| CN106502980A (en) * | 2016-10-09 | 2017-03-15 | 武汉斗鱼网络科技有限公司 | A kind of search method and system based on text morpheme cutting |
| CN106502980B (en) * | 2016-10-09 | 2019-05-17 | 武汉斗鱼网络科技有限公司 | A retrieval method and system based on text morpheme segmentation |
| CN106971000B (en) * | 2017-04-12 | 2020-04-28 | 北京焦点新干线信息技术有限公司 | A search method and device |
| CN107247798A (en) * | 2017-06-27 | 2017-10-13 | 北京京东尚科信息技术有限公司 | The method and apparatus for building search dictionary |
| WO2019056958A1 (en) * | 2017-09-22 | 2019-03-28 | 阿里巴巴集团控股有限公司 | Trending keyword acquisition method, device and server |
| CN109542612A (en) * | 2017-09-22 | 2019-03-29 | 阿里巴巴集团控股有限公司 | A kind of hot spot keyword acquisition methods, device and server |
| CN112507181A (en) * | 2019-09-16 | 2021-03-16 | 百度在线网络技术(北京)有限公司 | Search request classification method and device, electronic equipment and storage medium |
| CN112507181B (en) * | 2019-09-16 | 2023-09-29 | 百度在线网络技术(北京)有限公司 | Search request classification method, device, electronic equipment and storage medium |
| CN115587243A (en) * | 2022-09-28 | 2023-01-10 | 云南腾云信息产业有限公司 | Multi-mode rich search word bank optimized search word segmentation method, device, equipment and storage medium |
| CN116955284A (en) * | 2023-07-26 | 2023-10-27 | 中国银行股份有限公司 | Log retrieval word stock updating method, device, equipment and medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN103559313B (en) | 2018-02-23 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN103559313B (en) | Searching method and device | |
| US11803596B2 (en) | Efficient forward ranking in a search engine | |
| CN107704480B (en) | Method and system for extending and reinforcing knowledge graph and computer medium | |
| US8713024B2 (en) | Efficient forward ranking in a search engine | |
| JP6266080B2 (en) | Method and system for evaluating matching between content item and image based on similarity score | |
| JP5679993B2 (en) | Method and query system for executing a query | |
| US20190057159A1 (en) | Method, apparatus, server, and storage medium for recalling for search | |
| JP5616444B2 (en) | Method and system for document indexing and data querying | |
| US8805755B2 (en) | Decomposable ranking for efficient precomputing | |
| JP6165955B1 (en) | Method and system for matching images and content using whitelist and blacklist in response to search query | |
| CN104008126A (en) | Method and device for segmentation on basis of webpage content classification | |
| CN114417116A (en) | Search method, apparatus, device, medium, and program product based on search word | |
| CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
| CN106919593B (en) | A search method and device | |
| CN105005619A (en) | Rapid retrieval method and system for mass website basic information | |
| CN103970732B (en) | Mining method and device of new word translation | |
| CN103744970A (en) | Method and device for determining subject term of picture | |
| CN105808607A (en) | Generation method and device of document index | |
| US9336317B2 (en) | System and method for searching aliases associated with an entity | |
| CN112182405A (en) | Data searching method, device, equipment and storage medium | |
| CN110866092B (en) | Information searching method and device, electronic equipment and storage medium | |
| US10606875B2 (en) | Search support apparatus and method | |
| KUO et al. | METHOD, SYSTEM, AND COMPUTER PROGRAM PRODUCT FOR LARGE LANGUAGE MODEL (LLM)-ENABLED SEARCHING | |
| TWI517058B (en) | Method and Device for Constructing Knowledge Base | |
| CN110737851A (en) | Method, device and equipment for semantization of hyperlink and computer readable storage medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| TR01 | Transfer of patent right | ||
| TR01 | Transfer of patent right |
Effective date of registration: 20220727 Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015 Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park) Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Patentee before: Qizhi software (Beijing) Co.,Ltd. |