WO2006122086A3 - Matching engine with signature generation and relevance detection - Google Patents
Matching engine with signature generation and relevance detection Download PDFInfo
- Publication number
- WO2006122086A3 WO2006122086A3 PCT/US2006/017846 US2006017846W WO2006122086A3 WO 2006122086 A3 WO2006122086 A3 WO 2006122086A3 US 2006017846 W US2006017846 W US 2006017846W WO 2006122086 A3 WO2006122086 A3 WO 2006122086A3
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- token
- document
- text
- signature generation
- matching engine
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A system and a method generates at least one signature associated with document. In one embodiment, a document comprised of text is received and parsed to generate a token set. The token set includes a plurality of tokens. Each token corresponds to the text in the document that is separated by a predefined character characteristic. A score is calculated for each token in the token set based on a frequency and distribution of the text in the document. Each token is then ranked based on the calculated score. A subset of the ranked tokes is selected and a signature is generated for each occurrence of the selected tokens. The selected list of signatures is then output.
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN2006800227288A CN101248433B (en) | 2005-05-09 | 2006-05-08 | Matching engine with signature generation and relevance detection |
| JP2008511259A JP5072832B2 (en) | 2005-05-09 | 2006-05-08 | Signature generation and matching engine with relevance |
Applications Claiming Priority (6)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US67931405P | 2005-05-09 | 2005-05-09 | |
| US60/679,314 | 2005-05-09 | ||
| US11/361,340 US7516130B2 (en) | 2005-05-09 | 2006-02-24 | Matching engine with signature generation |
| US11/361,447 US7747642B2 (en) | 2005-05-09 | 2006-02-24 | Matching engine for querying relevant documents |
| US11/361,447 | 2006-02-24 | ||
| US11/361,340 | 2006-02-24 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2006122086A2 WO2006122086A2 (en) | 2006-11-16 |
| WO2006122086A3 true WO2006122086A3 (en) | 2007-03-29 |
Family
ID=37397221
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2006/017846 Ceased WO2006122086A2 (en) | 2005-05-09 | 2006-05-08 | Matching engine with signature generation and relevance detection |
Country Status (3)
| Country | Link |
|---|---|
| JP (1) | JP5072832B2 (en) |
| CN (1) | CN101248433B (en) |
| WO (1) | WO2006122086A2 (en) |
Families Citing this family (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7516130B2 (en) * | 2005-05-09 | 2009-04-07 | Trend Micro, Inc. | Matching engine with signature generation |
| US7860853B2 (en) * | 2007-02-14 | 2010-12-28 | Provilla, Inc. | Document matching engine using asymmetric signature generation |
| JP5372853B2 (en) | 2010-07-08 | 2013-12-18 | 株式会社日立製作所 | Digital sequence feature amount calculation method and digital sequence feature amount calculation apparatus |
| JP5617674B2 (en) * | 2011-02-14 | 2014-11-05 | 日本電気株式会社 | Inter-document similarity calculation apparatus, inter-document similarity calculation method, and inter-document similarity calculation program |
| CN107798637A (en) * | 2016-08-30 | 2018-03-13 | 北京国双科技有限公司 | The different acquisition methods and device for sentencing document of accomplice |
| CN112580108B (en) * | 2020-12-10 | 2024-04-19 | 深圳证券信息有限公司 | Signature and seal integrity verification method and computer equipment |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6493709B1 (en) * | 1998-07-31 | 2002-12-10 | The Regents Of The University Of California | Method and apparatus for digitally shredding similar documents within large document sets in a data processing environment |
| US6584470B2 (en) * | 2001-03-01 | 2003-06-24 | Intelliseek, Inc. | Multi-layered semiotic mechanism for answering natural language questions using document retrieval combined with information extraction |
| US20030172066A1 (en) * | 2002-01-22 | 2003-09-11 | International Business Machines Corporation | System and method for detecting duplicate and similar documents |
Family Cites Families (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5325091A (en) * | 1992-08-13 | 1994-06-28 | Xerox Corporation | Text-compression technique using frequency-ordered array of word-number mappers |
| JP2758826B2 (en) * | 1994-03-02 | 1998-05-28 | 株式会社リコー | Document search device |
| JPH09293079A (en) * | 1996-04-18 | 1997-11-11 | Internatl Business Mach Corp <Ibm> | Information retrieving method, information retrieving device and storage medium for storing information retrieving program |
| EP0961210A1 (en) * | 1998-05-29 | 1999-12-01 | Xerox Corporation | Signature file based semantic caching of queries |
| CN1369839A (en) * | 2001-02-16 | 2002-09-18 | 意蓝科技股份有限公司 | System and method for judging file relevance |
| JP2002269116A (en) * | 2001-03-13 | 2002-09-20 | Ricoh Co Ltd | Document search system and program |
| JP3719666B2 (en) * | 2001-07-12 | 2005-11-24 | 松下電器産業株式会社 | Document verification device |
-
2006
- 2006-05-08 JP JP2008511259A patent/JP5072832B2/en active Active
- 2006-05-08 CN CN2006800227288A patent/CN101248433B/en active Active
- 2006-05-08 WO PCT/US2006/017846 patent/WO2006122086A2/en not_active Ceased
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6493709B1 (en) * | 1998-07-31 | 2002-12-10 | The Regents Of The University Of California | Method and apparatus for digitally shredding similar documents within large document sets in a data processing environment |
| US6584470B2 (en) * | 2001-03-01 | 2003-06-24 | Intelliseek, Inc. | Multi-layered semiotic mechanism for answering natural language questions using document retrieval combined with information extraction |
| US20030172066A1 (en) * | 2002-01-22 | 2003-09-11 | International Business Machines Corporation | System and method for detecting duplicate and similar documents |
Also Published As
| Publication number | Publication date |
|---|---|
| JP2008541272A (en) | 2008-11-20 |
| WO2006122086A2 (en) | 2006-11-16 |
| CN101248433B (en) | 2010-09-01 |
| CN101248433A (en) | 2008-08-20 |
| JP5072832B2 (en) | 2012-11-14 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Seidman | Authorship verification using the impostors method | |
| WO2007084836A3 (en) | Match-based employment system and method | |
| WO2007100916A3 (en) | Systems, methods, and media for outputting a dataset based upon anomaly detection | |
| WO2008101130A3 (en) | Music-based search engine | |
| WO2008033780A3 (en) | Recommending advertising key phrases | |
| WO2005070111A3 (en) | Content presentation and management system associating base content and relevant additional content | |
| WO2010019567A8 (en) | Signed digital documents | |
| WO2008027503A3 (en) | Semantic search engine | |
| WO2004086192A3 (en) | Systems and methods for interactive search query refinement | |
| WO2010008800A3 (en) | Query identification and association | |
| WO2012040674A3 (en) | Providing answers to questions including assembling answers from multiple document segments | |
| WO2011041205A3 (en) | A method and system for extraction | |
| WO2007033468A3 (en) | System and method configuring contextual based content with publisher content for display on a user interface | |
| WO2003091913A3 (en) | Optimisation of the design of a component | |
| WO2009136990A3 (en) | Algorithmically generated topic pages | |
| WO2009079274A3 (en) | Method and apparatus for processing a multi-step authentication sequence | |
| WO2006078794A3 (en) | Matching and ranking of sponsored search listings incorporating web search technology and web content | |
| WO2005013060A3 (en) | Method and apparatus for changing firmware in a gaming printer | |
| WO2008032169A3 (en) | Method and apparatus for improved text input | |
| EP1752906A3 (en) | Information processing apparatus and method | |
| WO2007131105A3 (en) | A method and system for spam, virus, and spyware scanning in a data network | |
| Na et al. | Improving opinion retrieval based on query-specific sentiment lexicon | |
| WO2006122086A3 (en) | Matching engine with signature generation and relevance detection | |
| Zhao et al. | Effective linguistic steganography detection | |
| WO2004088451A3 (en) | Apparatus and method for processing digital music files |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| WWE | Wipo information: entry into national phase |
Ref document number: 200680022728.8 Country of ref document: CN |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| ENP | Entry into the national phase |
Ref document number: 2008511259 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| NENP | Non-entry into the national phase |
Ref country code: RU |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 06759366 Country of ref document: EP Kind code of ref document: A2 |