CN103425639A - Similar information identifying method based on information fingerprints - Google Patents
Similar information identifying method based on information fingerprints Download PDFInfo
- Publication number
- CN103425639A CN103425639A CN2013104024655A CN201310402465A CN103425639A CN 103425639 A CN103425639 A CN 103425639A CN 2013104024655 A CN2013104024655 A CN 2013104024655A CN 201310402465 A CN201310402465 A CN 201310402465A CN 103425639 A CN103425639 A CN 103425639A
- Authority
- CN
- China
- Prior art keywords
- information
- recognition methods
- information fingerprint
- word
- methods based
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000000605 extraction Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 230000007547 defect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000012467 final product Substances 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Images
Landscapes
- Collating Specific Patterns (AREA)
Abstract
The invention discloses a similar information identifying method based on information fingerprints. According to the method, firstly, Chinese word segmenting is carried out on texts of each document, word frequency is calculated, and words with high word frequency are taken out to serve as feature values; according to the extracted feature values, an information fingerprint of each document is calculated, the information fingerprints of the two documents are compared, and if a comparison result is larger than a threshold value, then the result that the two documents are similar can be judged. According to the method, the situation that in the prior art, all information of two documents needs to be calculated and compared can be avoided, and calculation complexity is greatly reduced; due to the fact that the information fingerprint of each document is unique, when similarity of multiple documents is judged, what is needed is to compare the information fingerprints, and work efficiency can be effectively improved.
Description
Technical field
The present invention relates to a kind of analog information recognition methods based on information fingerprint.
Background technology
Existing duplicate message recognition methods is mainly that information is carried out to the md5 coding, then compares the md5 value of two information, if just the same, these two information are the same, and the value difference is different.Existing analog information recognition methods mainly, is carried out cutting to two information by character, in order character is compared, and according to the number percent of the on all four character in position, draws two similarities between information.
The major defect of existing judgement duplicate message technology is to judge the duplicate information of character string, if two identical information, one has added individual space or other character, and program will be judged as and not be duplicate message, and degree of accuracy is not high.Existing analog information recognition methods major defect more all will be contrasted the character string of cutting at every turn, and calculated amount is large, and under the environment of large data, performance is very low.
Summary of the invention
The purpose of this invention is to provide a kind of degree of accuracy high, be applicable to the analog information recognition methods under large data environment.
Analog information recognition methods based on information fingerprint of the present invention comprises the following steps:
Text to document carries out Chinese word segmentation;
The statistics word frequency, take out the forward word of word frequency, as eigenwert;
Calculate the information fingerprint of document according to the eigenwert extracted;
The information fingerprint of two pieces of documents of comparison, if comparison result is greater than threshold values, be judged as similar article.
Analog information recognition methods based on information fingerprint of the present invention, adopt the forward word of extraction word frequency to carry out the computing information fingerprint as eigenwert, thereby go the method that judges that whether document is similar, compare existing duplicate message recognition methods, if add a small amount of character in one piece of document therein, result to judgement can not exert an influence yet, and can improve the accuracy of judgement.In addition, because the information fingerprint of document has uniqueness, when many pieces of documents judgement similaritys, only need mutual comparison information fingerprint get final product, can improve existing analog information recognition methods calculated amount large, the shortcoming of performance poor efficiency under data environment greatly.
The accompanying drawing explanation
Fig. 1 is the analog information recognition methods process flow diagram that the present invention is based on information fingerprint.
The calculation procedure process flow diagram that Fig. 2 is document information fingerprint of the present invention.
Embodiment
The analog information recognition methods based on information fingerprint as shown in Figure 1 comprises the following steps:
Text to document carries out Chinese word segmentation;
The statistics word frequency, take out the forward word of word frequency, as eigenwert;
Calculate the information fingerprint of document according to the eigenwert extracted;
The information fingerprint of two pieces of documents of comparison, if comparison result is greater than threshold values, be judged as similar article.
When wherein extracting quantity as the word of eigenwert and being 15-25, can substantially meet the performance requirement of recognition methods, by a large amount of sampling tests, calculate and find, getting when word is 20 is optimal selection.
The step of the information fingerprint of calculating document as shown in Figure 2, comprising:
The eigenwert extracted is carried out respectively to the polynomials Hash operation of 64, draw 64
Cryptographic hash;
The cryptographic hash of 64 are carried out to computing, if the i position of this cryptographic hash is 1, this position equals special
Weight, if the i position of cryptographic hash is 0. this equals the negative of feature weight;
Equal the number of times that this word occurs on the weight numerical value of this feature;
After completing the processing of all eigenwerts, all eigenwerts are carried out to addition by the row correspondence, draw the numeral of a string 64, finally by positive number, corresponding position is made as 1, and the position that negative is corresponding is made as 0, has just obtained the 01 value array of 64, i.e. the information fingerprint of this information.
The reason of choosing 64 Hash operation is that producible 2 64 powers that are combined as, met the requirement of the present invention to repetition rate when using 64, select the words repetition rate of 32 still can be higher, in the time of 128, figure place is oversize, can affect calculated performance, so the Hash operation of 64 is selected in compromise.
Adopt with exclusive disjunction or XOR when the information fingerprint of two pieces of documents of comparison, according to operation result 0 or 1 number of times occurred, can judge fast the similarity of two documents.
When the information fingerprint of two pieces of documents of comparison adopts XOR, count 1 number occurred in result, if zero degree, this means that these two information are just the same.1 number of times occurred is more, means that two information are more different.In addition, this method is when being judged, the threshold values of choosing is 3.Equal at 3 o'clock if 1 number of times occurred is less than, can be judged as analog information.
When the information fingerprint of two pieces of documents of comparison adopts with exclusive disjunction, count 0 number occurred in result, if zero degree, this means that these two information are just the same.0 number of times occurred is more, means that two information are more different.During judgement, the threshold values of choosing is 3.Equal at 3 o'clock if 0 number of times occurred is less than, can be judged as analog information.
Claims (7)
1. the analog information recognition methods based on information fingerprint, it is characterized in that: described method comprises the following steps:
Text to document carries out Chinese word segmentation;
The statistics word frequency, take out the forward word of word frequency, as eigenwert;
Calculate the information fingerprint of document according to the eigenwert extracted;
The information fingerprint of two pieces of documents of comparison, if comparison result is greater than threshold values, be judged as similar article.
2. the analog information recognition methods based on information fingerprint according to claim 1, is characterized in that, the information fingerprint that calculates document comprises the following steps:
The eigenwert extracted is carried out respectively to the polynomials Hash operation of 64, draw the cryptographic hash of 64;
The cryptographic hash of 64 are carried out to computing, if the i position of this cryptographic hash is 1, this equals the weight of feature; If the i position of cryptographic hash is 0. this equals the negative of feature weight;
After completing the processing of all eigenwerts, all eigenwerts are carried out to addition by the row correspondence, draw the numeral of a string 64, finally by positive number, corresponding position is made as 1, and the position that negative is corresponding is made as 0, has just obtained the 01 value array of 64.
3. the analog information recognition methods based on information fingerprint according to claim 2, is characterized in that, equals the number of times that this word occurs on the weight numerical value of feature.
4. the analog information recognition methods based on information fingerprint according to claim 1, is characterized in that, extraction is 15-25 as the quantity of the word of eigenwert.
5. the analog information recognition methods based on information fingerprint according to claim 1, is characterized in that, extraction is 20 as the quantity of the word of eigenwert.
6. the analog information recognition methods based on information fingerprint according to claim 1, is characterized in that, while comparing the information fingerprint of two pieces of documents, adopt with or logical operation.
7. the analog information recognition methods based on information fingerprint according to claim 1, is characterized in that, adopts the XOR computing while comparing the information fingerprint of two pieces of documents.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN2013104024655A CN103425639A (en) | 2013-09-06 | 2013-09-06 | Similar information identifying method based on information fingerprints |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN2013104024655A CN103425639A (en) | 2013-09-06 | 2013-09-06 | Similar information identifying method based on information fingerprints |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN103425639A true CN103425639A (en) | 2013-12-04 |
Family
ID=49650403
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN2013104024655A Pending CN103425639A (en) | 2013-09-06 | 2013-09-06 | Similar information identifying method based on information fingerprints |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN103425639A (en) |
Cited By (17)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105260878A (en) * | 2015-09-23 | 2016-01-20 | 成都网安科技发展有限公司 | Auxiliary secret-level setting method and device |
| CN105681046A (en) * | 2016-02-29 | 2016-06-15 | 郑州悉知信息科技股份有限公司 | UGC fingerprint signature determination method and device as well as UGC deduplication method and device |
| CN105844118A (en) * | 2016-04-15 | 2016-08-10 | 宝利九章(北京)数据技术有限公司 | Methods and system for preventing data leakage |
| CN105844214A (en) * | 2016-03-02 | 2016-08-10 | 华南理工大学 | Multi-path depth encoded information fingerprint extraction method based on bit space |
| CN105893859A (en) * | 2016-04-15 | 2016-08-24 | 宝利九章(北京)数据技术有限公司 | Data leakage prevention method and system |
| CN105956482A (en) * | 2016-04-15 | 2016-09-21 | 宝利九章(北京)数据技术有限公司 | Method and system for data leakage protection |
| CN105955978A (en) * | 2016-04-15 | 2016-09-21 | 宝利九章(北京)数据技术有限公司 | Method and system for data leakage protection |
| CN106649214A (en) * | 2016-10-21 | 2017-05-10 | 天津海量信息技术股份有限公司 | Internet information content similarity definition method |
| CN106649257A (en) * | 2016-09-21 | 2017-05-10 | 联动优势科技有限公司 | Semantic section conversion method and device |
| CN106873964A (en) * | 2016-12-23 | 2017-06-20 | 浙江工业大学 | A kind of improved SimHash detection method of code similarities |
| CN107368472A (en) * | 2017-07-26 | 2017-11-21 | 成都科来软件有限公司 | It is a kind of can iteration optimization document analysis result store method |
| CN108282328A (en) * | 2018-02-02 | 2018-07-13 | 沈阳航空航天大学 | A kind of ciphertext statistical method based on homomorphic cryptography |
| CN109145080A (en) * | 2018-07-26 | 2019-01-04 | 新华三信息安全技术有限公司 | A kind of text fingerprints preparation method and device |
| CN110019642A (en) * | 2017-08-06 | 2019-07-16 | 北京国双科技有限公司 | A kind of Similar Text detection method and device |
| CN110134761A (en) * | 2019-04-16 | 2019-08-16 | 深圳壹账通智能科技有限公司 | Judgment document information retrieval method, device, computer equipment and storage medium |
| CN110891010A (en) * | 2018-09-05 | 2020-03-17 | 百度在线网络技术(北京)有限公司 | Method and apparatus for transmitting information |
| CN112733523A (en) * | 2020-12-30 | 2021-04-30 | 深信服科技股份有限公司 | Document sending method, device, equipment and storage medium |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8140505B1 (en) * | 2005-03-31 | 2012-03-20 | Google Inc. | Near-duplicate document detection for web crawling |
| CN103123618A (en) * | 2011-11-21 | 2013-05-29 | 北京新媒传信科技有限公司 | Text similarity obtaining method and device |
-
2013
- 2013-09-06 CN CN2013104024655A patent/CN103425639A/en active Pending
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8140505B1 (en) * | 2005-03-31 | 2012-03-20 | Google Inc. | Near-duplicate document detection for web crawling |
| CN103123618A (en) * | 2011-11-21 | 2013-05-29 | 北京新媒传信科技有限公司 | Text similarity obtaining method and device |
Non-Patent Citations (4)
| Title |
|---|
| MIN LU等: "Rank hash similarity for fast similarity", 《INFORMATION PROCESSING & MANAGEMENT》, vol. 49, no. 1, 31 January 2013 (2013-01-31), pages 158 - 168 * |
| 段飞: "相似网页识别算法的研究与实现", 《中国优秀硕士学位论文全文数据库》, 15 September 2011 (2011-09-15) * |
| 胡可云等: "《数据挖掘理论与应用》", 30 April 2008, article "数据挖掘理论与应用", pages: 124-125 * |
| 董博等: "基于多SimHash指纹的近似文本检测", 《小型微型计算机系统》, vol. 32, no. 11, 30 November 2011 (2011-11-30), pages 2152 - 2157 * |
Cited By (25)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105260878A (en) * | 2015-09-23 | 2016-01-20 | 成都网安科技发展有限公司 | Auxiliary secret-level setting method and device |
| CN105681046A (en) * | 2016-02-29 | 2016-06-15 | 郑州悉知信息科技股份有限公司 | UGC fingerprint signature determination method and device as well as UGC deduplication method and device |
| CN105844214B (en) * | 2016-03-02 | 2019-06-21 | 华南理工大学 | An information fingerprint extraction method based on multi-path depth coding in bit space |
| CN105844214A (en) * | 2016-03-02 | 2016-08-10 | 华南理工大学 | Multi-path depth encoded information fingerprint extraction method based on bit space |
| CN105844118B (en) * | 2016-04-15 | 2020-02-21 | 量子创新(北京)信息技术有限公司 | Method and system for data leakage protection |
| CN105893859A (en) * | 2016-04-15 | 2016-08-24 | 宝利九章(北京)数据技术有限公司 | Data leakage prevention method and system |
| CN105955978A (en) * | 2016-04-15 | 2016-09-21 | 宝利九章(北京)数据技术有限公司 | Method and system for data leakage protection |
| CN105955978B (en) * | 2016-04-15 | 2019-07-02 | 宝利九章(北京)数据技术有限公司 | Method and system for leakage prevention |
| CN105956482A (en) * | 2016-04-15 | 2016-09-21 | 宝利九章(北京)数据技术有限公司 | Method and system for data leakage protection |
| CN105844118A (en) * | 2016-04-15 | 2016-08-10 | 宝利九章(北京)数据技术有限公司 | Methods and system for preventing data leakage |
| CN105956482B (en) * | 2016-04-15 | 2019-06-04 | 宝利九章(北京)数据技术有限公司 | Method and system for leakage prevention |
| CN105893859B (en) * | 2016-04-15 | 2019-05-03 | 宝利九章(北京)数据技术有限公司 | Method and system for leakage prevention |
| CN106649257A (en) * | 2016-09-21 | 2017-05-10 | 联动优势科技有限公司 | Semantic section conversion method and device |
| CN106649257B (en) * | 2016-09-21 | 2019-06-18 | 联动优势科技有限公司 | A semantic segment conversion method and device |
| CN106649214A (en) * | 2016-10-21 | 2017-05-10 | 天津海量信息技术股份有限公司 | Internet information content similarity definition method |
| CN106873964A (en) * | 2016-12-23 | 2017-06-20 | 浙江工业大学 | A kind of improved SimHash detection method of code similarities |
| CN107368472B (en) * | 2017-07-26 | 2021-01-05 | 成都科来软件有限公司 | Storage method of document analysis result capable of being iteratively optimized |
| CN107368472A (en) * | 2017-07-26 | 2017-11-21 | 成都科来软件有限公司 | It is a kind of can iteration optimization document analysis result store method |
| CN110019642A (en) * | 2017-08-06 | 2019-07-16 | 北京国双科技有限公司 | A kind of Similar Text detection method and device |
| CN108282328A (en) * | 2018-02-02 | 2018-07-13 | 沈阳航空航天大学 | A kind of ciphertext statistical method based on homomorphic cryptography |
| CN109145080A (en) * | 2018-07-26 | 2019-01-04 | 新华三信息安全技术有限公司 | A kind of text fingerprints preparation method and device |
| CN109145080B (en) * | 2018-07-26 | 2021-01-01 | 新华三信息安全技术有限公司 | Text fingerprint obtaining method and device |
| CN110891010A (en) * | 2018-09-05 | 2020-03-17 | 百度在线网络技术(北京)有限公司 | Method and apparatus for transmitting information |
| CN110134761A (en) * | 2019-04-16 | 2019-08-16 | 深圳壹账通智能科技有限公司 | Judgment document information retrieval method, device, computer equipment and storage medium |
| CN112733523A (en) * | 2020-12-30 | 2021-04-30 | 深信服科技股份有限公司 | Document sending method, device, equipment and storage medium |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN103425639A (en) | Similar information identifying method based on information fingerprints | |
| CN103279478B (en) | A kind of based on distributed mutual information file characteristics extracting method | |
| CN110046665A (en) | Based on isolated two abnormal classification point detecting method of forest, information data processing terminal | |
| CN103123618B (en) | Text similarity acquisition methods and device | |
| CN102081598B (en) | Method for detecting duplicated texts | |
| CN104217222B (en) | A kind of image matching method represented based on stochastical sampling Hash | |
| CN103744835A (en) | Text keyword extracting method based on subject model | |
| CN103679012A (en) | Clustering method and device of portable execute (PE) files | |
| CN105677661A (en) | Method for detecting repetition data of social media | |
| CN105574156B (en) | Text Clustering Method, device and calculating equipment | |
| CN106909575B (en) | Text clustering method and device | |
| CN104636319A (en) | Text duplicate removal method and device | |
| CN103336890A (en) | Method for quickly computing similarity of software | |
| CN104516862A (en) | Method and system for selecting and reading coded format of target document | |
| CN105824825A (en) | Sensitive data identifying method and apparatus | |
| CN105550253B (en) | Method and device for acquiring type relationship | |
| CN103049263A (en) | Document classification method based on similarity | |
| CN115941281A (en) | An abnormal network traffic detection method based on bidirectional temporal convolutional neural network and multi-head self-attention mechanism | |
| CN106202007B (en) | A Method for Evaluating the Similarity of MATLAB Program Files | |
| CN105045600A (en) | Parallel sorting method of multiple groups of ordered sequences | |
| CN104751459B (en) | Multi-dimensional feature similarity measuring optimizing method and image matching method | |
| CN104778202B (en) | The analysis method and system of event evolutionary process based on keyword | |
| CN103246640B (en) | A kind of method and device detecting repeated text | |
| CN112861505B (en) | Repeatability detection method, device and electronic equipment | |
| Tang et al. | An optimization algorithm of Chinese word segmentation based on dictionary |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20131204 |