TWI486796B - Text filtering method and text filtering system - Google Patents
Text filtering method and text filtering system Download PDFInfo
- Publication number
- TWI486796B TWI486796B TW099113502A TW99113502A TWI486796B TW I486796 B TWI486796 B TW I486796B TW 099113502 A TW099113502 A TW 099113502A TW 99113502 A TW99113502 A TW 99113502A TW I486796 B TWI486796 B TW I486796B
- Authority
- TW
- Taiwan
- Prior art keywords
- character
- matching
- node
- keyword
- current
- Prior art date
Links
- 238000001914 filtration Methods 0.000 title claims description 77
- 238000000034 method Methods 0.000 title claims description 36
- 238000012545 processing Methods 0.000 claims description 13
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 108010001267 Protein Subunits Proteins 0.000 claims 4
- 239000000470 constituent Substances 0.000 claims 1
- 230000006399 behavior Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000010079 rubber tapping Methods 0.000 description 3
- 238000012544 monitoring process Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 229910003460 diamond Inorganic materials 0.000 description 1
- 239000010432 diamond Substances 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Description
本申請涉及互聯網應用技術領域,特別是涉及一種文本過濾方法及文本過濾系統。The present application relates to the field of Internet application technologies, and in particular, to a text filtering method and a text filtering system.
隨著互聯網的不斷發展,網路上的信息量日益劇增,而互聯網的開放性也導致在網路中存在很多不良資訊,因此在互聯網上,對資訊進行監控和過濾的已經成為普遍需求。With the continuous development of the Internet, the amount of information on the Internet has increased dramatically, and the openness of the Internet has led to a lot of bad information in the Internet. Therefore, monitoring and filtering of information on the Internet has become a common demand.
應用內容過濾技術,可以實現對網上不良資訊的過濾,從而保障網路環境的安全。網路上的資訊有多種表現形式,其中文本形式是最為常見的一種。文本過濾指的是從大量文本資訊中找出特定文本的過程,目前,常見的文本過濾方法都是基於基本關鍵字匹配技術實現的:系統根據預先設置的多個與不良資訊相關的關鍵字,在輸入文本中進行查找,如果在輸入文本中發現與關鍵字相匹配的內容,則對這部分內容或全部的輸入文本進行過濾或替換處理。Application content filtering technology can filter bad information on the Internet to ensure the security of the network environment. There are many forms of information on the web, and the text form is the most common one. Text filtering refers to the process of finding specific text from a large amount of text information. At present, common text filtering methods are implemented based on basic keyword matching technology: the system is based on a plurality of pre-set keywords related to bad information. The search is performed in the input text, and if the content matching the keyword is found in the input text, the part or all of the input text is filtered or replaced.
上述文本過濾方法,只能過濾出與關鍵字完全匹配的文本,但是卻無法判斷整個文本的立場或態度,例如,在電子商務網站中,將“竊聽器”定義為過濾關鍵字,但是現有的文本過濾方法會將“禁止銷售竊聽器”這樣的合法文本也視為不良資訊進行過濾。可見,現有的基於基本關鍵字匹配技術的文本過濾方法,識別正確率較低,無法滿足資訊過濾的實際應用需求。The above text filtering method can only filter out the text that exactly matches the keyword, but can't judge the position or attitude of the whole text. For example, in the e-commerce website, the "snicker" is defined as the filtering keyword, but the existing one. The text filtering method also filters legitimate texts such as "no sales of bugs" as bad information. It can be seen that the existing text filtering method based on the basic keyword matching technology has a low recognition accuracy rate and cannot meet the practical application requirements of information filtering.
為解決上述技術問題,本申請實施例提供一種文本過濾方法及文本過濾系統,以提高文本過濾的正確率,技術方案如下:本申請提供一種文本過濾方法,包括:預先在文本過濾系統中定義語義關鍵字,該語義關鍵字,至少由基本關鍵字和邏輯關係符構成;該文本過濾系統獲得輸入文本後,根據預先定義的語義關鍵字,在該輸入文本中查找構成該語義關鍵字的基本關鍵字;如果在該輸入文本中查找到與至少一個該基本關鍵字相匹配的文本內容,則進一步對查找到的文本內容進行語義匹配;該語義匹配包括:根據構成該語義關鍵字的邏輯關係符,將所查找到的文本內容與該語義關鍵字進行匹配;如果該語義匹配成功,則對匹配成功的文本內容進行過濾處理。To solve the above technical problem, the embodiment of the present application provides a text filtering method and a text filtering system to improve the accuracy of text filtering. The technical solution is as follows: The present application provides a text filtering method, including: predefining semantics in a text filtering system. a keyword, the semantic keyword, consisting of at least a basic keyword and a logical relationship; after the text filtering system obtains the input text, searching for the basic key constituting the semantic keyword in the input text according to the predefined semantic keyword a word; if a text content matching at least one of the basic keywords is found in the input text, further performing semantic matching on the found text content; the semantic matching comprises: according to a logical relationship constituting the semantic keyword And matching the found text content with the semantic keyword; if the semantic matching is successful, filtering the successfully matched text content.
本申請還提供一種文本過濾系統,包括:關鍵字儲存單元,用於儲存預先定義的語義關鍵字,該語義關鍵字,至少由基本關鍵字和邏輯關係符構成;基本查找單元,用於在該文本過濾系統獲得輸入文本後,根據預先定義的語義關鍵字,在該輸入文本中查找構成該語義關鍵字的基本關鍵字;語義匹配單元,用於在該基本查找單元在該輸入文本中查找到與至少一個該基本關鍵字相匹配的文本內容時,進一步對查找到的文本內容進行語義匹配;該語義匹配單元包括:用於根據構成該語義關鍵字的邏輯關係符,將所查找到的文本內容與該語義關鍵字進行匹配的邏輯匹配子單元;過濾處理單元,用於在該語義匹配單元匹配成功時,對匹配成功的文本內容進行過濾處理。The present application further provides a text filtering system, including: a keyword storage unit, configured to store a predefined semantic keyword, the semantic keyword being composed of at least a basic keyword and a logical relationship; a basic searching unit for After obtaining the input text, the text filtering system searches for the basic keyword constituting the semantic keyword in the input text according to the predefined semantic keyword; the semantic matching unit is configured to find in the input text in the basic search unit And semantically matching the found text content when the text content matches at least one of the basic keywords; the semantic matching unit includes: the text to be found according to the logical relationship constituting the semantic keyword A logical matching sub-unit that matches the content with the semantic keyword; and a filtering processing unit configured to filter the successfully matched text content when the semantic matching unit matches successfully.
本申請所提供的文本過濾方法及系統,使用基本關鍵字和邏輯關係符結合的方式對文本內容進行過濾,與現有技術相比,能夠有效地結合基本關鍵字在整個文本中的語義進行過濾,提高過濾的準確性。The text filtering method and system provided by the present application filter the text content by using a combination of basic keywords and logical relationships, and can effectively combine the semantics of the basic keywords in the entire text compared with the prior art. Improve the accuracy of filtering.
現有的文本過濾方法,僅根據簡單關鍵字進行過濾,並且不具備邏輯分析能力,因此會存在很多誤報情況。例如前文提到的“禁止銷售竊聽器”文本,雖然包含關鍵字“竊聽器”,但是結合“禁止”這一否定詞,又使得該段文本實際上成為合法資訊而不應被過濾處理。針對這一問題,本申請實施例提供一種文本過濾方法如下:預先在文本過濾系統中定義語義關鍵字,該語義關鍵字,至少由基本關鍵字和邏輯關係符構成;該文本過濾系統獲得輸入文本後,根據預先定義的語義關鍵字,在該輸入文本中查找構成該語義關鍵字的基本關鍵字;如果在該輸入文本中查找到與至少一個該基本關鍵字相匹配的文本內容,則進一步對查找到的文本內容進行語義匹配;該語義匹配包括:根據構成該語義關鍵字的邏輯關係符,將所查找到的文本內容與該語義關鍵字進行匹配;如果該語義匹配成功,則對匹配成功的文本內容進行過濾處理。Existing text filtering methods are only filtered based on simple keywords and do not have logic analysis capabilities, so there are many false positives. For example, the "prohibited sales of bugs" text mentioned above, although containing the keyword "snuggle", combined with the negative word "prohibited", makes the piece of text actually legal information and should not be filtered. To solve this problem, the embodiment of the present application provides a text filtering method as follows: a semantic keyword is defined in a text filtering system, and the semantic keyword is composed of at least a basic keyword and a logical relationship; the text filtering system obtains input text. Then, according to the predefined semantic keyword, the basic keyword constituting the semantic keyword is searched in the input text; if the text content matching the at least one basic keyword is found in the input text, further The found text content is semantically matched; the semantic matching includes: matching the found text content with the semantic keyword according to the logical relationship constituting the semantic keyword; if the semantic matching is successful, the matching is successful The text content is filtered.
上述文本過濾方法,使用基本關鍵字和邏輯關係符結合的方式對文本內容進行過濾,與現有技術相比,能夠有效地結合基本關鍵字在整個文本中的語義進行過濾,減少誤報的情況,提高過濾的準確性。The above text filtering method uses a combination of basic keywords and logical relationship characters to filter text content, and can effectively combine the semantics of basic keywords in the entire text to reduce the false positives and improve the situation compared with the prior art. The accuracy of the filter.
為了使本技術領域的人員更好地理解本申請中的技術方案,下面將結合本申請實施例中的附圖,對本申請實施例中的技術方案進行清楚、完整地描述,顯然,所描述的實施例僅僅是本申請一部分實施例,而不是全部的實施例。基於本申請中的實施例,本領域普通技術人員在沒有做出創造性勞動前提下所獲得的所有其他實施例,都應當屬於本申請保護的範圍。The technical solutions in the embodiments of the present application are clearly and completely described in the following, in which the technical solutions in the embodiments of the present application are clearly and completely described. The embodiments are only a part of the embodiments of the present application, and not all of them. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope shall fall within the scope of the application.
實施例一:Embodiment 1:
本申請實施例中,基於語義關鍵字對文本內容進行過濾,語義關鍵字包括兩個基本組成部分:基本關鍵字和邏輯關係符。其中,基本關鍵字為獨立的一個詞或短語,即相當於現有技術中所採用的簡單關鍵字;而邏輯關係符則用於表示邏輯關係,基本的邏輯關係包括:“與”、“或”、“非”等,在語義關鍵字中,可以分別用符號“&”、“|”、“~”來表示。以下為應用於電子商務網站文本過濾的語義關鍵字的幾種簡單示例:In the embodiment of the present application, the text content is filtered based on the semantic keyword, and the semantic keyword includes two basic components: a basic keyword and a logical relationship. Wherein, the basic keyword is an independent word or phrase, which is equivalent to the simple keyword used in the prior art; and the logical relationship is used to represent the logical relationship, and the basic logical relationship includes: "and", "or "," "Non", etc., in the semantic keyword, can be represented by the symbols "&", "|", "~", respectively. Here are a few simple examples of semantic keywords that apply to text filtering for e-commerce sites:
a)手機竊聽~反a) Mobile phone tapping ~ anti
該語義關鍵字表示的語義是:如果商品的資訊中包含“手機竊聽”,且不包含“反”,則需要對該商品資訊進行過濾處理。The semantics of the semantic keyword representation is: if the information of the product includes "mobile phone eavesdropping" and does not include "reverse", the product information needs to be filtered.
b)監視攝像頭|無線監控攝像頭b) surveillance camera | wireless surveillance camera
該語義關鍵字表示的語義是:如果商品的資訊中包含“監視攝像頭”或者包含“無線監控攝像頭”,則需要對該商品資訊進行過濾處理。The semantics of the semantic keyword representation is that if the information of the product includes a "surveillance camera" or includes a "wireless surveillance camera", the product information needs to be filtered.
c)軍用&紮帶c) Military & Cable Ties
該語義關鍵字表示的語義是:如果商品的資訊中包含“軍用”且包含“紮帶”,則需要對該商品資訊進行過濾處理。The semantics of the semantic keyword representation is: if the information of the product includes "military" and includes "bundle", the product information needs to be filtered.
對於一個語義關鍵字而言,其最簡形式應該是:2個基本關鍵字+1個邏輯關係符,以上三個示例都屬於這種情況。對於只包括一個基本關鍵字的情況,實際上與現有技術相同,本申請實施例對這種情況不再進行介紹。可以理解的是,在一個語義關鍵字中,可以包括更多的基本關鍵字以及邏輯關係符,以表示更為複雜的語義,例如:For a semantic keyword, its simplest form should be: 2 basic keywords + 1 logical relationship, which are the case in the above three examples. For the case where only one basic keyword is included, it is actually the same as the prior art, and the embodiment of the present application does not introduce this case. It can be understood that in a semantic keyword, more basic keywords and logical relationships can be included to represent more complex semantics, such as:
d)手機竊聽~(反|防)d) mobile phone tapping ~ (reverse | anti)
該語義關鍵字表示的語義是:如果商品的資訊中包含“手機竊聽”,且不包含“反”或“防”,則需要對該商品資訊進行過濾處理。The semantics of the semantic keyword representation is: if the information of the product includes "mobile phone eavesdropping" and does not include "reverse" or "prevention", the product information needs to be filtered.
在本申請的優選方案中,還可以對語義關鍵字的內容做進一步的擴展,例如:In the preferred solution of the present application, the content of the semantic keyword can be further extended, for example:
可以在語義關鍵字中加入過濾條件。事實上,與前面所述的基本關鍵字和邏輯關係符不同的是:過濾條件與文本的具體內容無關,其作用是從文本的其他方面屬性對過濾做進一步的限定。例如,限定文本的來源、類別等等,從而實現更為準確的過濾。Filters can be added to semantic keywords. In fact, unlike the basic keywords and logical relationships described above, the filter condition is independent of the specific content of the text, and its role is to further limit the filtering from other aspects of the text. For example, limit the source of text, categories, and so on, for more accurate filtering.
在語義關鍵字中,還可以進一步加入過濾行為,以表明對於與語義關鍵字文本部分相匹配的內容,具體做何種處理,例如內容遮罩、內容替換等等。In the semantic keyword, a filtering behavior may be further added to indicate what kind of processing is performed for the content that matches the semantic keyword text portion, such as content masking, content replacement, and the like.
以下三個例子,分別在前述的a)、b)、c)中添加過濾條件和過濾行為,對語義關鍵字的擴展形式進行示意性說明,其中,分號之前為基本關鍵字和邏輯關係符、分號之後為擴展內容,各項擴展內容之間用逗號隔開。當然,本實施例並不對語義的具體格式進行限定。In the following three examples, filter conditions and filtering behaviors are added in the foregoing a), b), and c), and the extended form of the semantic keyword is schematically illustrated. The semicolon is preceded by a basic keyword and a logical relationship. After the semicolon is extended content, the extensions are separated by commas. Of course, this embodiment does not limit the specific format of the semantics.
a1)手機竊聽~反;商品類別:1002,過濾行為:下架,A1) Mobile phone tapping ~ reverse; product category: 1002, filtering behavior: off the shelf,
該語義關鍵字表示的語義是:如果商品的資訊中包含“手機竊聽”,且不包含“反”、並且商品類別是1002,則需要對該商品資訊進行下架處理。The semantics of the semantic keyword representation is: if the information of the product includes "mobile phone eavesdropping" and does not include "reverse", and the product category is 1002, the product information needs to be taken off the shelf.
b1)監視攝像頭|無線監控攝像頭;商品類別:101,過濾行為:下架,B1) surveillance camera | wireless surveillance camera; product category: 101, filtering behavior: off the shelf,
該語義關鍵字表示的語義是:如果商品的資訊中包含“監視攝像頭”或者包含“無線監控攝像頭”、並且商品類別是101,則需要對該商品資訊進行下架處理。The semantics of the semantic keyword representation is that if the information of the product includes a "surveillance camera" or a "wireless surveillance camera" and the product category is 101, the product information needs to be taken off the shelf.
c1)軍用&紮帶;商品類別:50001,過濾行為:下架,C1) military & cable tie; product category: 50001, filtering behavior: off the shelf,
該語義關鍵字表示的語義是:如果商品的資訊中包含“軍用”且包含“紮帶”、並且商品類別是50001,則需要對該商品資訊進行下架處理。The semantics of the semantic keyword representation is: if the information of the product includes "military" and includes "ticket", and the product category is 50001, the product information needs to be taken off the shelf.
下面進一步結合具體的流程,對本實施例進行說明,圖1所示為本申請實施例的文本過濾方法流程圖,包括以下步驟:S101,文本過濾系統獲得輸入文本後,根據預先定義的語義關鍵字,在輸入文本中查找構成該語義關鍵字的基本關鍵字;在本步驟中,系統在獲得一段輸入文本後,將首先在輸入文本中對基本關鍵字進行查找,並對查找結果進行記錄。例如,對於前述的b)或b1),系統將首先在輸入文本中查找“監視攝像頭”和“無線監控攝像頭”的內容。本步驟的具體實現可以與現有技術中基於簡單關鍵字相匹配的方法類似,本實施例不做詳細說明。The following is a description of the present embodiment in conjunction with a specific process. FIG. 1 is a flowchart of a text filtering method according to an embodiment of the present application, including the following steps: S101: After the text filtering system obtains input text, according to a predefined semantic keyword In the input text, the basic keywords constituting the semantic keyword are searched; in this step, after obtaining an input text, the system first searches for the basic keyword in the input text, and records the search result. For example, for b) or b1) above, the system will first look up the contents of the "surveillance camera" and "wireless surveillance camera" in the input text. The specific implementation of this step may be similar to the method for matching based on simple keywords in the prior art, and is not described in detail in this embodiment.
S102,如果在輸入文本中查找到與至少一個基本關鍵字相匹配的文本內容,則進一步對查找到的文本內容進行語義匹配;在S101中,僅僅是根據基本關鍵字的內容進行查找,如果沒有查找到與任何基本關鍵字相匹配的內容,說明不需要對輸入文本進行過濾處理;如果查找到與至少一個基本關鍵字相匹配的文本內容,則需要進一步將所查找到的文本內容與完整的語義關鍵字進行比較,這一步驟稱為語義匹配。S102. If the text content matching the at least one basic keyword is found in the input text, the searched text content is further semantically matched; in S101, only the content of the basic keyword is searched, if not Find content that matches any of the basic keywords, indicating that you don't need to filter the input text; if you find text content that matches at least one of the basic keywords, you need to further further find the text content you are looking for. Semantic keywords are compared, this step is called semantic matching.
如果語義關鍵字中只包括基本關鍵字和邏輯關係符,那麼語義匹配的具體內容就是:根據預先定義的語義關鍵字中的邏輯關係符,將所查找到的文本內容與該語義關鍵字進行匹配。例如:對於前述的a),系統在輸入文本中查找到了基本關鍵字“手機竊聽”,並且沒有查找到基本關鍵字“反”,即兩個基本關鍵字的實際查找結果符合在語義關鍵字a)中所定義的兩個基本關鍵字的邏輯關係“非”,因此,所查找到的內容與語義關鍵字a)匹配成功;對於前述的c),系統在輸入文本中查找到了基本關鍵字“紮帶”,並且沒有查找到基本關鍵字“軍用”,即兩個基本關鍵字的實際查找結果不符合在語義關鍵字c)中所定義的兩個基本關鍵字的邏輯關係“與”,因此,所查找到的內容與語義關鍵字a)匹配失敗;如果語義關鍵字中還包括擴展內容“過濾條件”,那麼在進行語義匹配時,還要進一步考慮輸入文本的屬性與過濾條件的匹配情況。If only the basic keyword and the logical relationship are included in the semantic keyword, the specific content of the semantic matching is: matching the found text content with the semantic keyword according to the logical relationship in the predefined semantic keyword. . For example, for the aforementioned a), the system finds the basic keyword "mobile phone eavesdropping" in the input text, and does not find the basic keyword "reverse", that is, the actual search result of the two basic keywords conforms to the semantic keyword a. The logical relationship between the two basic keywords defined in the "Non", therefore, the found content matches the semantic keyword a) successfully; for the aforementioned c), the system finds the basic keyword in the input text. Tethered, and did not find the basic keyword "military", that is, the actual search results of the two basic keywords do not match the logical relationship "and" of the two basic keywords defined in the semantic keyword c), therefore The found content fails to match the semantic keyword a); if the semantic keyword also includes the extended content "filter condition", then when the semantic matching is performed, the matching between the attribute of the input text and the filter condition is further considered. .
S103,如果語義匹配成功,則對匹配成功的文本內容進行過濾處理。S103. If the semantic matching is successful, filtering the successfully matched text content.
對於在S102中與語義關鍵字匹配成功的文本,系統將進行過濾處理。如果在語義關鍵字中包含了“過濾行為”,則系統將根據“過濾行為”的具體內容對文本進行過濾處理。如果在語義關鍵字中沒有包含“過濾行為”,那麼系統將根據預置的預設方式對文本內容進行過濾處理。For the text that successfully matches the semantic keyword in S102, the system will perform filtering processing. If the "filtering behavior" is included in the semantic keyword, the system will filter the text according to the specific content of the "filtering behavior". If the "filtering behavior" is not included in the semantic keyword, the system will filter the text content according to the preset preset mode.
實施例二:Embodiment 2:
現有技術中,需要在輸入文本中,逐個查找每個詞。本實施例針對實施例一中的步驟S101,提出一種改進的基本關鍵字查找方法,以提高關鍵字查找的處理效率。In the prior art, each word needs to be searched one by one in the input text. This embodiment provides an improved basic keyword search method for step S101 in the first embodiment to improve the processing efficiency of the keyword search.
在實際的文本過濾應用中,很多需要過濾的詞都是具有相同部分的,例如:“竊聽器”、“竊聽設備”、“竊聽軟體”等等,對於這類詞,可以採用樹形查找的方法,提高查找效率。In the actual text filtering application, many words that need to be filtered have the same part, such as: "snicker", "eavesdropping device", "eavesdropping software", etc. For such words, tree search can be used. Method to improve the efficiency of searching.
首先,在系統中以字元為單位,按照樹形結構儲存每個基本關鍵字。以基本關鍵字的首字元為根節點、末字元為葉子節點,具有相同首字元的基本關鍵字共用同一個根節點。例如,對於“ab”、“abc”、“ade”三個基本關鍵字,可以按照如圖2所示的結構進行儲存。First, each basic keyword is stored in a tree structure in units of characters in the system. The first character of the basic keyword is the root node, and the last character is the leaf node. The basic keywords with the same first character share the same root node. For example, for the three basic keywords "ab", "abc", and "ade", they can be stored in the structure shown in FIG. 2.
在圖2中,圓形表示根節點或一般節點,菱形表示葉子節點,由於“ab”、“abc”、“ade”三個詞具有相同的首字元“a”,因此共用同一個根節點1;三個詞的末字元分別為“b”、“c”、“e”,因此這三個字元分別為葉子節點2、3、5。需要注意的是,對於字元“b”,儘管在第二個詞中不是末字元,但是其在第一個詞中是末字元,因此仍然成為葉子節點。也就是說,葉子節點不一定是樹形結構的末端節點,但是樹形結構的末端節點一定是葉子節點。In FIG. 2, a circle represents a root node or a general node, and a diamond represents a leaf node. Since the words "ab", "abc", and "ade" have the same first character "a", they share the same root node. 1; The last characters of the three words are "b", "c", and "e", respectively, so the three characters are leaf nodes 2, 3, and 5, respectively. It should be noted that for the character "b", although it is not the last character in the second word, it is the last character in the first word, and therefore remains a leaf node. That is to say, the leaf node is not necessarily the end node of the tree structure, but the end node of the tree structure must be the leaf node.
圖3所示為基於樹形結構的基本關鍵字查找方法流程圖,包括以下步驟:FIG. 3 is a flow chart of a basic keyword search method based on a tree structure, including the following steps:
S301,獲取該輸入文本中的一個字元;設置該字元為當前字元、並且設置樹形結構的根節點為當前節點。根據實際的過濾應用需求,所獲取的字元可以是輸入文本的首字元,也可以是從輸入文本的任意處選取的一個字元。S301: Acquire one character in the input text; set the character as a current character, and set a root node of the tree structure as a current node. According to the actual filtering application requirements, the obtained character may be the first character of the input text or a character selected from any part of the input text.
S302,將當前字元與當前節點進行匹配;如果匹配成功,則執行S303,否則,執行S304。S302: Match the current character with the current node; if the matching is successful, execute S303; otherwise, execute S304.
S303,判斷當前節點是否具有子節點,如果否,則結束查找;如果是,則轉到當前字元的後一字元、當前節點的子節點,然後執行S302。S303. Determine whether the current node has a child node. If not, end the search; if yes, go to the next character of the current character, the child node of the current node, and then execute S302.
S304,判斷當前節點是否具有兄弟節點,如果否,則結束查找;如果是,則保持當前字元不變、轉到當前節點的兄弟節點,然後執行S302。S304. Determine whether the current node has a sibling node. If not, end the search; if yes, keep the current character unchanged, go to the sibling node of the current node, and then execute S302.
結束查找後,系統連接當前節點與根節點得到匹配路徑,並根據匹配路徑上的匹配成功的葉子節點確定所查找到的基本關鍵字。After the search is completed, the system connects the current node to the root node to obtain a matching path, and determines the found basic keywords according to the matching matching leaf nodes on the matching path.
以下結合兩個具體的例子,對基於樹形結構的基本關鍵字查找方法進行說明:The following is a description of the basic keyword search method based on the tree structure in combination with two specific examples:
1)假設輸入文本為adf,系統獲得字元“a”後,遍歷關鍵字庫中的根節點,發現與節點1匹配成功,並且節點1具有子節點,則進一步將字元“d”與節點1的子節點2、4匹配。1) Assuming that the input text is adf, after the system obtains the character "a", it traverses the root node in the keyword library, finds that the matching with node 1 is successful, and node 1 has child nodes, and further the character "d" and the node The child nodes 2 and 4 of 1 match.
字元“d”與節點4匹配成功,並且節點4具有子節點,則進一步將字元“f”與節點4的子節點5匹配,字元“f”與節點5匹配失敗,並且節點5沒有其他的兄弟節點,此時結束查找。當前的匹配路徑為1-4-5,在路徑中沒有包含匹配成功的葉子節點,因此,可以確定在輸入文本中沒有查到基本關鍵字。The character "d" matches the node 4 successfully, and the node 4 has a child node, then the character "f" is further matched with the child node 5 of the node 4, the character "f" fails to match the node 5, and the node 5 does not have Other sibling nodes, at this point, the search ends. The current matching path is 1-4-5, and the leaf node that matches the success is not included in the path. Therefore, it can be determined that the basic keyword is not found in the input text.
2)假設輸入文本為abc,系統獲得字元“a”後,遍曆關鍵字庫中的根節點,發現與節點1匹配成功,並且節點1具有子節點,則進一步將字元“b”與節點1的子節點2、4匹配。2) Assuming that the input text is abc, after the system obtains the character "a", it traverses the root node in the keyword library and finds that the matching with node 1 is successful, and node 1 has a child node, and further the character "b" is The child nodes 2, 4 of node 1 match.
字元“b”與節點2匹配成功,並且節點2具有子節點,則進一步將字元“c”與節點2的子節點3匹配,字元“c”與節點4匹配成功,並且節點3沒有子節點,此時結束查找。當前的匹配路徑為1-2-3,其中,節點2和3均為匹配成功的葉子節點,因此,可以根據節點2和3的內容,確定在輸入文本中查找到了基本關鍵字“ab”和“abc”。The character "b" matches the node 2 successfully, and the node 2 has the child node, then the character "c" is further matched with the child node 3 of the node 2, the character "c" matches the node 4 successfully, and the node 3 does not have The child node ends the search at this time. The current matching path is 1-2-3, where nodes 2 and 3 are leaf nodes that match successfully. Therefore, according to the contents of nodes 2 and 3, it is determined that the basic keyword "ab" is found in the input text and "abc".
可見,應用上述基於樹形結構的基本關鍵字查找方法,每一級的匹配操作都是僅針對上一次匹配成功的節點來進行,這樣,就不需要針對輸入文本的每個字元與全部的關鍵字字元進行逐一匹配,從而有效地提高關鍵字查找的處理效率。It can be seen that the basic keyword search method based on the tree structure is applied, and the matching operation of each level is performed only for the node with the last matching success, so that each character and all the keys for the input text are not needed. The word characters are matched one by one, thereby effectively improving the processing efficiency of the keyword search.
在上述例子中,是以首字元為根節點進行說明,這種方法適用於多個基本關鍵字具有相同首碼的情況。可以理解的是,針對多個基本關鍵字具有相同尾碼的情況,例如:“電話竊聽”、“手機竊聽”、“手機監聽”等,也可以以基本關鍵字的末字元為根節點、首字元為葉子節點的樹形結構儲存關鍵字。相應地,在匹配過程中,應按照從後向前的順序對輸入文本的字元進行匹配,具體的方法實現與前述類似,這裏不再重複說明。In the above example, the first character is taken as the root node, and this method is applicable to the case where a plurality of basic keywords have the same first code. It can be understood that, for a case where multiple basic keywords have the same tail code, for example, “telephone eavesdropping”, “mobile phone eavesdropping”, “mobile phone monitoring”, etc., the last character of the basic keyword may be used as the root node, The first character is a tree structure of leaf nodes to store keywords. Correspondingly, in the matching process, the characters of the input text should be matched in the order from the back to the front, and the specific method implementation is similar to the foregoing, and the description is not repeated here.
此外,為了逃避文本過濾,現在已經有很多人會在發佈的文本中使用特殊字元,例如“竊-聽-器”、“竊聽器”等等,對於這種情況,可以進一步結合字典功能來查找關鍵字。In addition, in order to avoid text filtering, many people now use special characters in the published text, such as "stealing-listening", "spying", etc. In this case, the dictionary function can be further combined. Find keywords.
字典定義了一組字元集合,並且定義了字元的原型,原型可以是字元本身,例如如字元‘a’的原型就是‘a’本身,也可以是另外一個字元,例如繁體字符的原型是對應的簡體中文。常用的字典包括:簡體字典、繁體字典、英文字典、數位字典等等。此外,業務人員還可以實際的需求,自行定義字典,例如,將字元“-”的原型定義為空字元。The dictionary defines a set of characters and defines the prototype of the character. The prototype can be the character itself. For example, if the prototype of the character 'a' is 'a' itself, it can be another character, such as a traditional character. The prototype is the corresponding Simplified Chinese. Commonly used dictionaries include: simplified dictionary, traditional dictionary, English dictionary, digital dictionary and so on. In addition, the business personnel can define the dictionary by themselves, for example, by defining the prototype of the character "-" as an empty character.
根據前述的步驟S302,系統可以在將當前字元與當前節點進行匹配之前,在字典中查找該當前字元是否具有原型字元;如果是,則將其轉換為相應的原型字元,並以該原型字元為當前字元,與該當前節點進行匹配。According to the foregoing step S302, the system may find in the dictionary whether the current character has a prototype character before matching the current character with the current node; if yes, convert it to the corresponding prototype character, and The prototype character is the current character and matches the current node.
以本實施例前述的例2)進行說明,假如輸入文本為aBc,則系統在將字元“B”與節點2進行匹配之前,通過遍歷所有字典,發現字元“B”具有原型“b”,則將原輸入文本中的“B”轉換為原型“b”,然後以“b”為當前字元與節點2進行匹配。In the foregoing example 2) of the present embodiment, if the input text is aBc, the system traverses all the dictionaries and finds that the character "B" has the prototype "b" before matching the character "B" with the node 2. , converts the "B" in the original input text to the prototype "b", and then matches the node with the "b" as the current character.
對於“竊-聽-器”這類文本,系統通過查詢字典,會將字元“-”轉為空字元。在匹配過程中,當系統匹配到“竊”之後,將跳過空字元直接與“聽”進行匹配。For text such as "stealing-listening", the system will convert the character "-" to an empty character by querying the dictionary. During the matching process, when the system matches "stealing", the empty character will be skipped and directly matched with "listening".
可見,通過查詢字典以及轉換字元,可以讓系統識別出更多的不良資訊,從而實現更好的文本過濾效果。It can be seen that by querying the dictionary and converting the characters, the system can identify more bad information, thereby achieving better text filtering effect.
相應於上面的方法實施例,本申請還提供一種文本過濾系統,參見圖4所示,包括:關鍵字儲存單元410,用於儲存預先定義的語義關鍵字,該語義關鍵字,至少由基本關鍵字和邏輯關係符構成;基本查找單元420,用於在該文本過濾系統獲得輸入文本後,根據預先定義的語義關鍵字,在該輸入文本中查找構成該語義關鍵字的基本關鍵字;語義匹配單元430,用於在該基本查找單元420在該輸入文本中查找到與至少一個該基本關鍵字相匹配的文本內容時,進一步對查找到的文本內容進行語義匹配;該語義匹配單元430包括:用於根據構成該語義關鍵字的邏輯關係符,將所查找到的文本內容與該語義關鍵字進行匹配的邏輯匹配子單元431;過濾處理單元440,用於在該語義匹配單元430匹配成功時,對匹配成功的文本內容進行過濾處理。Corresponding to the above method embodiment, the present application further provides a text filtering system, as shown in FIG. 4, including: a keyword storage unit 410, configured to store a predefined semantic keyword, the semantic keyword, at least by a basic key a basic search unit 420, configured to: after the text filtering system obtains the input text, search for the basic keyword constituting the semantic keyword in the input text according to the predefined semantic keyword; semantic matching The unit 430 is configured to further perform semantic matching on the found text content when the basic search unit 420 finds the text content matching the at least one basic keyword in the input text; the semantic matching unit 430 includes: a logic matching sub-unit 431 for matching the found text content with the semantic keyword according to a logical relationship constituting the semantic keyword; and a filtering processing unit 440, configured to be successful when the semantic matching unit 430 matches , filter the text content that matches the success.
其中,該關鍵字儲存單元,以字元為單位,按照樹形結構儲存該基本關鍵字;其中,基本關鍵字的首字元為根節點、末字元為葉子節點,具有相同首字元的基本關鍵字共用同一個根節點;參見圖5所示,該基本查找單元420,可以包括:文本獲取子單元421,用於獲取該輸入文本中的一個字元c1;字元匹配子單元422,用於以c1為當前字元、該樹形結構的根節點為當前節點,將當前字元與當前節點進行匹配;如果當前字元與當前節點匹配成功,且當前節點具有子節點,則將當前字元的後一字元,與當前節點的子節點進行匹配;如果當前字元與當前節點匹配失敗,且當前節點具有兄弟節點,則將當前字元與當前節點的兄弟節點進行匹配;重複本步驟;確定子單元423,用於連接當前節點與根節點得到匹配路徑,並根據該匹配路徑上匹配成功的葉子節點確定所查找到的基本關鍵字;參見圖6所示,該基本查找單元420,還可以包括:字元轉換子單元424,用於在該字元匹配子單元422進行匹配之前,在字典中查找該當前字元是否具有原型字元,如果是,則將其轉換為相應的原型字元;則該字元匹配子單元423,以該原型字元為當前字元,與該當前節點進行匹配。The keyword storage unit stores the basic keyword in a tree structure in units of characters; wherein, the first character of the basic keyword is a root node, and the last character is a leaf node, and has the same first character. The basic keyword shares the same root node; as shown in FIG. 5, the basic search unit 420 may include: a text acquisition sub-unit 421, configured to acquire one character c1 in the input text; and a character matching sub-unit 422. For the current character of c1, the root node of the tree structure is the current node, and the current character is matched with the current node; if the current character matches the current node successfully, and the current node has a child node, the current node will be current The next character of the character is matched with the child node of the current node; if the current character fails to match the current node, and the current node has a sibling node, the current character is matched with the sibling node of the current node; a determining sub-unit 423, configured to connect the current node to the root node to obtain a matching path, and determine the checked according to the matching leaf node on the matching path. The basic keyword to be obtained; as shown in FIG. 6, the basic search unit 420 may further include: a character conversion sub-unit 424, configured to search the dictionary for the current word before the character matching sub-unit 422 performs matching. Whether the element has a prototype character, and if so, converts it to a corresponding prototype character; then the character matches sub-unit 423, with the prototype character being the current character, matching the current node.
該語義關鍵字的構成還可以包括:過濾條件;則該語義匹配單元430還包括:用於將該輸入文本的屬性與該過濾條件進行匹配的類別匹配子單元432,如圖7所示。The composition of the semantic keyword may further include: a filtering condition; the semantic matching unit 430 further includes: a category matching sub-unit 432 for matching the attribute of the input text with the filtering condition, as shown in FIG. 7.
該語義關鍵字的構成還可以包括:過濾行為;則該過濾處理單元,用於根據該過濾行為,對所查找到的文本內容進行過濾處理。The composition of the semantic keyword may further include: a filtering behavior; and the filtering processing unit is configured to perform filtering processing on the found text content according to the filtering behavior.
為了描述的方便,描述以上系統時以功能分為各種單元分別描述。當然,在實施本申請時可以把各單元的功能在同一個或多個軟體和/或硬體中實現。For the convenience of description, the above system is described as being divided into various units by function. Of course, the functions of each unit can be implemented in the same software or software and/or hardware in the implementation of the present application.
通過以上的實施方式的描述可知,本領域的技術人員可以清楚地瞭解到本申請可借助軟體加必需的通用硬體平臺的方式來實現。基於這樣的理解,本申請的技術方案本質上或者說對現有技術做出貢獻的部分可以以軟體產品的形式體現出來,該電腦軟體產品可以儲存在儲存媒體中,如ROM/RAM、磁碟、光碟等,包括若干指令用以使得一台電腦設備(可以是個人電腦,伺服器,或者網路設備等)執行本申請各個實施例或者實施例的某些部分所述的方法。As can be seen from the description of the above embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of a software plus a necessary universal hardware platform. Based on such understanding, the technical solution of the present application can be embodied in the form of a software product in essence or in the form of a software product, which can be stored in a storage medium such as a ROM/RAM, a disk, A disc or the like includes instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in various embodiments of the present application or portions of the embodiments.
本說明書中的各個實施例均採用遞進的方式描述,各個實施例之間相同相似的部分互相參見即可,每個實施例重點說明的都是與其他實施例的不同之處。尤其,對於系統實施例而言,由於其基本相似於方法實施例,所以描述得比較簡單,相關之處參見方法實施例的部分說明即可。以上所描述的系統實施例僅僅是示意性的,其中該作為分離部件說明的單元可以是或者也可以不是物理上分開的,作為單元顯示的部件可以是或者也可以不是物理單元,即可以位於一個地方,或者也可以分佈到多個網路單元上。可以根據實際的需要選擇其中的部分或者全部模組來實現本實施例方案的目的。本領域普通技術人員在不付出創造性勞動的情況下,即可以理解並實施。The various embodiments in the specification are described in a progressive manner, and the same or similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment. The system embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, ie may be located in one Places, or they can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without any creative effort.
本申請可用於眾多通用或專用的計算系統環境或配置中。例如:個人電腦、伺服器電腦、手持設備或可擕式設備、平板型設備、多處理器系統、基於微處理器的系統、置頂盒、可編程的消費電子設備、網路PC、小型電腦、大型電腦、包括以上任何系統或設備的分散式計算環境等等。This application can be used in a variety of general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics devices, network PCs, small computers, Large computers, decentralized computing environments including any of the above systems or devices, and more.
本申請可以在由電腦執行的電腦可執行指令的一般上下文中描述,例如程式模組。一般地,程式模組包括執行特定任務或實現特定抽象資料類型的常式、程式、物件、元件、資料結構等等。也可以在分散式計算環境中實踐本申請,在這些分散式計算環境中,由通過通信網路而被連接的遠端處理設備來執行任務。在分散式計算環境中,程式模組可以位於包括儲存設備在內的本地和遠端電腦儲存媒體中。The application can be described in the general context of computer-executable instructions executed by a computer, such as a program module. Generally, a program module includes routines, programs, objects, components, data structures, and the like that perform particular tasks or implement particular abstract data types. The present application can also be practiced in a distributed computing environment where tasks are performed by remote processing devices that are connected through a communication network. In a distributed computing environment, program modules can be located in local and remote computer storage media, including storage devices.
以上所述僅是本申請的具體實施方式,應當指出,對於本技術領域的普通技術人員來說,在不脫離本申請原理的前提下,還可以做出若干改進和潤飾,這些改進和潤飾也應視為本申請的保護範圍。The above description is only a specific embodiment of the present application, and it should be noted that those skilled in the art can also make several improvements and retouchings without departing from the principles of the present application. It should be considered as the scope of protection of this application.
410...關鍵字儲存單元410. . . Keyword storage unit
420...基本查找單元420. . . Basic search unit
430...語義匹配單元430. . . Semantic matching unit
431...邏輯匹配子單元431. . . Logical matching subunit
440...過濾處理單元440. . . Filter processing unit
421...文本獲取子單元421. . . Text acquisition subunit
422...字元匹配子單元422. . . Character matching subunit
423...確定子單元423. . . Determining subunit
424...字元轉換子單元424. . . Character conversion subunit
432...類別匹配子單元432. . . Category matching subunit
為了更清楚地說明本申請實施例或現有技術中的技術方案,下面將對實施例或現有技術描述中所需要使用的附圖作簡單地介紹,顯而易見地,下面描述中的附圖僅僅是本申請中記載的一些實施例,對於本領域普通技術人員來講,在不付出創造性勞動的前提下,還可以根據這些附圖獲得其他的附圖。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only Some of the embodiments described in the application can be used to obtain other drawings based on these drawings without departing from the prior art.
圖1為本申請實施例的文本過濾方法流程圖;FIG. 1 is a flowchart of a text filtering method according to an embodiment of the present application;
圖2為本申請實施例的基本關鍵字樹形儲存結構示意圖;2 is a schematic diagram of a basic keyword tree storage structure according to an embodiment of the present application;
圖3為本申請實施例的基本關鍵字查找方法流程圖;3 is a flowchart of a basic keyword searching method according to an embodiment of the present application;
圖4為本申請實施例的文本過濾系統的結構示意圖;4 is a schematic structural diagram of a text filtering system according to an embodiment of the present application;
圖5為本申請實施例的基本查找單元的一種結構示意圖;FIG. 5 is a schematic structural diagram of a basic search unit according to an embodiment of the present application;
圖6為本申請實施例的基本查找單元的另一種結構示意圖;6 is another schematic structural diagram of a basic search unit according to an embodiment of the present application;
圖7為本申請實施例的語義匹配單元的一種結構示意圖。FIG. 7 is a schematic structural diagram of a semantic matching unit according to an embodiment of the present application.
Claims (8)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW099113502A TWI486796B (en) | 2010-04-28 | 2010-04-28 | Text filtering method and text filtering system |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW099113502A TWI486796B (en) | 2010-04-28 | 2010-04-28 | Text filtering method and text filtering system |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| TW201137635A TW201137635A (en) | 2011-11-01 |
| TWI486796B true TWI486796B (en) | 2015-06-01 |
Family
ID=46759587
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| TW099113502A TWI486796B (en) | 2010-04-28 | 2010-04-28 | Text filtering method and text filtering system |
Country Status (1)
| Country | Link |
|---|---|
| TW (1) | TWI486796B (en) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20030191847A1 (en) * | 2002-01-16 | 2003-10-09 | Xerox Corporation | Symmetrical structural pattern matching |
| US20060248067A1 (en) * | 2005-04-29 | 2006-11-02 | Brooks David A | Method and system for providing a shared search index in a peer to peer network |
| TW200939045A (en) * | 2008-03-06 | 2009-09-16 | Fu-Ren Lin | A method for comparing documents |
-
2010
- 2010-04-28 TW TW099113502A patent/TWI486796B/en not_active IP Right Cessation
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20030191847A1 (en) * | 2002-01-16 | 2003-10-09 | Xerox Corporation | Symmetrical structural pattern matching |
| US20060248067A1 (en) * | 2005-04-29 | 2006-11-02 | Brooks David A | Method and system for providing a shared search index in a peer to peer network |
| TW200939045A (en) * | 2008-03-06 | 2009-09-16 | Fu-Ren Lin | A method for comparing documents |
Also Published As
| Publication number | Publication date |
|---|---|
| TW201137635A (en) | 2011-11-01 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US9600570B2 (en) | Method and system for text filtering | |
| CN109670163B (en) | Information identification method, information recommendation method, template construction method and computing device | |
| Agarwal | Research on data preprocessing and categorization technique for smartphone review analysis | |
| US10013450B2 (en) | Using knowledge graphs to identify potential inconsistencies in works of authorship | |
| US10204385B2 (en) | Distance-based social message pruning | |
| US20150046781A1 (en) | Browsing images via mined hyperlinked text snippets | |
| US20150154537A1 (en) | Categorizing a use scenario of a product | |
| US20170109358A1 (en) | Method and system of determining enterprise content specific taxonomies and surrogate tags | |
| US9892191B2 (en) | Complex query handling | |
| CN104536950A (en) | Text summarization generating method and device | |
| CN112527963B (en) | Dictionary-based multi-label emotion classification method and device, equipment and storage medium | |
| AU2017216520A1 (en) | Common data repository for improving transactional efficiencies of user interactions with a computing device | |
| CN103778200A (en) | Method for extracting information source of message and system thereof | |
| US20150331953A1 (en) | Method and device for providing search engine label | |
| Vadapalli et al. | Twitterosint: automated cybersecurity threat intelligence collection and analysis using twitter data | |
| CN116955720A (en) | Data processing methods, devices, equipment, storage media and computer program products | |
| WO2020134626A1 (en) | Blockchain-based work evidence storage method, system, apparatus and device | |
| TWI486796B (en) | Text filtering method and text filtering system | |
| CN107918606B (en) | Method and device for identifying avatar nouns and computer readable storage medium | |
| CN112507684B (en) | Methods, devices, electronic equipment and storage media for detecting original text | |
| CN102646096A (en) | Relevant word search system and method | |
| HK1152123B (en) | A text filtering method and a text filtering system | |
| CN117010332A (en) | Application abnormality detection method, device, equipment and readable storage medium | |
| Thammasudjarit et al. | Using linkage information to improve the detection of relevant comment in social media | |
| CN106250370A (en) | A kind of method and apparatus obtaining near synonym |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| MM4A | Annulment or lapse of patent due to non-payment of fees |