[go: up one dir, main page]

WO2000079426A1 - Systeme et procede de detection de similarite de texte sur de courts passages - Google Patents

Systeme et procede de detection de similarite de texte sur de courts passages Download PDF

Info

Publication number
WO2000079426A1
WO2000079426A1 PCT/US2000/040238 US0040238W WO0079426A1 WO 2000079426 A1 WO2000079426 A1 WO 2000079426A1 US 0040238 W US0040238 W US 0040238W WO 0079426 A1 WO0079426 A1 WO 0079426A1
Authority
WO
WIPO (PCT)
Prior art keywords
primitive
features
common
normalizing
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2000/040238
Other languages
English (en)
Inventor
Judith L. Klavans
Eleazar Eskin
Vasileios Hatzivassiloglou
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Columbia University in the City of New York
Original Assignee
Columbia University in the City of New York
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Columbia University in the City of New York filed Critical Columbia University in the City of New York
Priority to EP00951059A priority Critical patent/EP1203309A4/fr
Publication of WO2000079426A1 publication Critical patent/WO2000079426A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing

Definitions

  • the present invention relates generally to natural language processing and more particularly relates to a system and method for determining the similarity of text in short passages.
  • a further problem with known techniques for detecting similarity is that the conventional notions of similarity which are applicable to large text samples, such as documents and large text segments, do not provide sufficient measures of similarity for measuring similarity in small text segments.
  • Standard notions of similarity generally involve the creation of a vector or profile of characteristics of a text fragment and determine a conceptual distance between vectors on the basis of frequencies.
  • Features typically include stemmed words, although multi-word units and collocations also have been used.
  • Typological characteristics, such as thesaural features have also been used to calculate features. The difference between vectors for one text unit (usually a query) and another text unit (usually a document) then determines closeness or similarity of the text units.
  • the text units are represented as vectors of sparse n-grams of word occurrences and learning is applied over those vectors. Though effective in the context of large document comparisons, a more fine-grained distinction for similarity measures is required to properly characterize the similarity of two small text segments.
  • a method for determining similarity in short text segments in accordance with the present invention includes the steps of determining common primitive features in the text segments, determining common composite features in the text segments and then calculating a similarity measure based upon the primitive and composite features.
  • the primitive features can be selected from the group including common single words, common noun phrases, synonyms, common semantic classes of verbs, and common proper nouns.
  • the composite features which represent relationships between and among the primitive features, can be selected from the group including primitive feature order restrictions, primitive feature distance restrictions, and primitive type restrictions.
  • the step of determining common primitive features can include the further steps of identifying common primitive features, assigning a value to the primitive features, and normalizing the feature values. Normalizing the values can include normalizing for text segment length and normalizing for the frequency of primitive feature occurrence. Similarly, determining composite features generally includes identifying the composite features, assigning a value to the composite features, and normalizing the feature values. Again, normalization of the feature values can include normalizing for text segment length and normalizing for the frequency of feature occurrence.
  • Figure 1 is a flow chart illustrating an overview of a present method for comparing small text segments
  • Figure 2 is a flow chart illustrating the step of defining similarity for small text segments in accordance with the present methods
  • Figure 3 is a flow chart illustrating the process of computing primitive features for use in detecting similarity in small text segments
  • Figure 4 is a flow chart illustrating the process of calculating composite features for use in detecting similarity of small text segments in accordance with the present methods
  • Figure 5 is a block diagram of a software system topology for determining similarity in small text segments in accordance with the present methods
  • Figure 6 is an illustration of exemplary short text segments
  • Figure 7 is a diagram illustrating a composite feature match between two of the short text segments provided in Figure 6 using a "same order" rule
  • Figure 8 is a diagram illustrating a composite feature match between two of the short text segments provided in Figure 6 using a "within distance" rule.
  • Figure 9 is a diagram illustrating a composite feature match between two of the short text segments provided in Figure 6 using a "primitive type" rule.
  • FIG. 1 is a flow chart illustrating an overview of the process used in the present invention for detecting similarity in small text segments.
  • a problem in the prior art is that the definition of similarity commonly used for large text segments, such as documents, is not sufficiently refined to provide an adequate measure of similarity when comparing small text segments.
  • small text segments refer to sentences, phrases and short paragraphs.
  • step 100 a definition of similarity for small text segments is provided. From this definition, the method proceeds to identify primitive features of the small text segments and determine feature values for the primitive features (step 105).
  • Primitive features are those which generally compare simple parts of speech and text, such as single words, word categories, or phrases such as noun phrases, synonyms, verb class and proper nouns.
  • the process can identify composite features of the short-text segments and determine composite feature values (step 110).
  • Composite features are those which compare relationships among two or more primitive features. Once primitive features and composite features have been identified and given an appropriate value, a machine learning algorithm is applied to classify small text segments as similar or not similar (step 115).
  • Figure 2 is a flow chart which illustrates the process of establishing an appropriate definition of similarity for small text segments.
  • two text units can be considered as similar if they share the same focus on a common concept, actor, object or action.
  • the common actor or object definition must perform or be subjected to the same action or be the subject of the same description. This is exemplified in the flow chart of Figure 2, where two small text segments are selected from a body of text and are analyzed. If the two text segments relate to a common concept (step 205), then further analysis is performed to see if the common concept relates to the same action (step 210) or relates to the same description (step 215).
  • Similar tests are performed to determine if the two text segments relate to a common actor (step 220) or to a common object (step 225). If there is no common concept, actor or object, the text segments are considered not similar (step 235). Similarly, for those text segments which do refer or relate to a common concept, actor or object, those segments will still be found not similar unless they also relate to a common action or involve the same description. Thus, for short text segments to be similar, they must contain a common concept, actor, or object which is also the subject of a common action or description.
  • the comparisons in steps 205, 220 and 225 can be the basis for primitive features 240. Those relationships between primitive features which are identified in steps 210, 215 can be referred to as composite features 245.
  • Figure 2 is illustrated as a sequential process, it represents a decision tree involved in a definition of similarity of two short text segments as applied in the present invention which can also be performed in a largely parallel manner. For example, decisions 205, 220 and 225 can be performed concurrently as can decisions 210 and 215. Using this definition of similarity for small text segments, a feature- based process can be employed which compares primitive and composite features of short text segments to determine if the definition is satisfied for two or more given input text segments.
  • Figure 3 is a flow chart which illustrates a method for extracting and scaling primitive features in accordance with the present invention.
  • the text segments are compared for a level of commonality, including determining whether there is a common single word (step 305), a common noun phrase (step 310), whether two words in the phrases are synonyms (step 315), whether the phrases include verbs having a common semantic class (step 320), and whether a common proper noun can be found in the two phrases (step 325). If none of these conditions are satisfied for the applied small text segments, there is no primitive feature common to these two text segments (step 327). When a primitive feature has been identified, e.g., one of the conditions in steps 305 through 325 is satisfied, a feature value is assigned to that primitive feature.
  • the values which are assigned to the features are determined by a machine learning algorithm, such as RIPPER, which is trained using a suitable training corpus.
  • RIPPER is a widely -used and effective rule induction system which is available from AT&T Laboratories and is described by Cohen in "Learning Trees and Rules with Set- Valued Features, Proceedings of the Fourteenth National Conference on Artificial Intelligence, American Association on Artificial Intelligence, 1996, which is incorporated by reference. It has been found that a sub- set of a corpus of 264 paragraphs which have been manually tagged by human readers as similar or not similar can be used to establish a feature rule set for RIPPER which is then suitable for assigning values to the features identified in the text segments.
  • the particular training corpus and learned rule set will generally vary depending on the desired application.
  • the values assigned will vary based on properties of the machine learning algorithm and training corpus.
  • these values can be normalized based on text length (step 335) and/or noted frequency of occurrence (step 340). Though normalization is optional, it is a desirable step to provide uniform and accurate results across varying types of text and length of text segments.
  • Primitive features provide a baseline indication of similarity.
  • relationships among primitive features referred to as composite features, can also be evaluated. Referring to Figure 4, a method of evaluating composite features is illustrated.
  • Composite features are those features which identify relationships among primitive feature pairs.
  • composite features are defined by placing different forms of restrictions on participating primitive feature pairs.
  • the primitive features identified in each of the small text segments are applied to a test layer 400 where various feature relationships are evaluated.
  • the relationships illustrated in test layer 400 are exemplary in nature and are not intended to illustrate an exhaustive list of possible relationships.
  • an large number of relationships between and among primitive features can be used to establish composite features.
  • one type of feature relationship for composite features can be that the primitives occur in the same order in each of the text samples (step 405). This is illustrated by example in Figure 7.
  • Figure 6 provides three short text segments to be compared.
  • Figure 7 illustrates a match according to the "same order" composite feature rule.
  • primitive features are identified by shading and the relationships which form the composite features are illustrated by connecting lines.
  • the primitive features ⁇ two, contact ⁇ appear in the same order in text segments Figure 6 (a) and 6 (b) from Figure 6.
  • Another possible relationship is that two pairs of primitive elements are required to occur within a certain distance in both text segments.
  • the maximum distance between the primitive elements which would satisfy the relationship can be a variable or a predetermined constant (step 410).
  • n is set to a value less than three.
  • the primitive features ⁇ contact, lost ⁇ do not appear in the same order, they occur within n words of each other (n ⁇ 3 in this case).
  • Yet another exemplary relationship can be that the two text segments include the same primitive feature types.
  • one primitive feature can be restricted to a simplex noun phrase while the other to a verb.
  • two noun phrases one from each text unit, must match according to the rule for matching simplex noun phrases and two verbs must match according to the applied rules of verb primitives (e.g., sharing the same semantic class).
  • This is illustrated in Figure 9 where the primitive feature "An OH-58 helicopter" is deemed a simplex noun phrase match with "the helicopter” and both phrases include a common verb, "lost".
  • feature values are assigned to those composite features identified (step 420).
  • the feature values are assigned by a machine learning algorithm, such as RIPPER, which has been trained on a suitable training corpus.
  • the feature values assigned to the composite feature can be normalized for text length and relative occurrence of the primitive feature or composite feature (steps 425, 430, respectively).
  • a machine learning algorithm is applied to determine a similarity value between the text segments (step 435).
  • the machine learning algorithm can perform a rule-based analysis to determine similarity. Alternatively, a simpler algorithm can be used to determine similarity by comparing the total feature value of the text segments being compared to a predetermined threshold value.
  • FIG. 5 is a block diagram of an exemplary software system for conducting the method described in connection with Figures 1-4.
  • the system is generally implemented in software for a general purpose computer, such as a personal computer or work station.
  • the system includes a main processing section 500.
  • One or more interface modules 510 are included for receiving text input for the text segments to be compared and for providing the text segments to the main processing section 500.
  • the text input can be provided by a number of sources, including but not limited to, computer readable memory, hard disks, optical disks, network databases, on-line sources, manual keyed input and the like. Based on the desired text source and input mechanism, one skilled in the art can provide appropriate text input interface module 510 hardware and software.
  • the main processing section 500 is also operatively coupled to a training corpus 515, which is generally stored in computer readable storage media.
  • the main processing section 500 is generally programmed in a structured manner which calls various subprograms, library routines, and the like to perform the various functions described in accordance with Figures 1-4.
  • the main processing section 500 can invoke the various subroutines sequentially (serial) or in a parallel, or batched, processing mode.
  • the received text is generally passed to a preprocessing routine 520.
  • the preprocessing routine cleans up the received text, such as by removing control characters from the text.
  • the preprocessing routine also performs part-of- speech (POS) tagging, using known techniques, such as are available in the ALEMBIC tool set, described by Aberdeen et al. in "MITRE: Description of the Alembic System as used for MUC-6," Proceedings of the Sixth Message
  • ALEMBIC provides a set of data and language processing tools which identify the various parts of speech present in the small text segments.
  • a noun phrase comparison subroutine 525 such as Linklt
  • Linklt can be employed to determine whether a common noun phrase is present in the applied text segments and for identifying simplex noun phrases and matching those that share the same noun head.
  • the Linklt tool is described by N. Wacholder in "Simplex NPs Clustered by Head: A Method for Identifying Significant Topics in a Document", Proceedings of the Workshop on the Computational Treatment of Nominals, October 1998, which is hereby incorporated by reference in its entirety.
  • the noun comparison algorithm can also be used to match those nouns identified using the ALEMBIC toolset using various predetermined matching criteria. Variations on proper noun matching can include restricting the proper noun type to a person, place or organization. Such subcategories can also be extracted using ALEMBIC's named entity finder.
  • a word co-occurrence detection sub-routine 540 can be called by the main program 500. Variations of the word co-occurrence operation can restrict matching to cases where the parts of speech of the words also match, or relax the comparison to cases where only the word stems of the two words are identical.
  • a synonym detection algorithm 530 can be called by the main processing routine 500.
  • a lexical database such as WordNet®, as described by G. Miller in "WordNet, An On-Line Lexical Database," International Journal of Lexicography, Vol. 3, No.
  • WordNet provides sense information and places words in sets of synonyms (synsets). Words that appear in the same synset are generally considered matches. Variations on this feature can be used to restrict the words being compared to a specific part-of- speech class.
  • a verb classifier and comparator algorithm 535 can be operatively coupled to the main processing section 500 and called by the main program.
  • Semantic classes for verbs have been found to be useful for determining document types and text similarity. This is discussed, for example, in "The Role of Verbs in Document Analysis” by J. Klavans et al., Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics, 1998, which is hereby incorporated by reference in its entirety.
  • those verbs which are found to have a common semantic class e.g., communication, motion, agreement, argument, etc., those verbs are considered to match.
  • the program operating in main processing section 500 can also provide algorithms to normalize feature values for text lengths and relative occurrence of the primitive.
  • each feature value can be normalized by the size of the textual segments in the pair. For example, for a pair of textual segments A and B, the feature values assigned are divided by a normalization value, N:
  • N ⁇ Length ⁇ A) x Length ⁇ B) (1)
  • Normalization of feature values can also be based on the relative frequency of occurrence of each primitive feature. Such normalization is motivated by the general observation that infrequently matching primitive elements are likely to have a higher impact on similarity than primitives which match more frequently. Such normalization is similar to the document frequency component of the commonly employed TF*IDF calculation.
  • each primitive feature is associated with a value which is equal to the number of textual units in which the primitive appeared in the corpus. For a primitive element which compares single words, this is the number of text segments which contain that word in the corpus; for a noun phrase, this is the number of textual units that contain noun phrases that share the same head; and similarly for other primitive types. We multiply each feature's value by:
  • the program in main processing section 500 generally employs a machine learning algorithm 545 to determine whether the text units match overall.
  • a suitable machine learning algorithm is RIPPER, as disclosed by Cohen in "Learning Trees and Rules with Set- Valued Features, Proceedings of the Fourteenth National Conference on Artificial Intelligence, American Association on Artificial Intelligence, 1996, which is incorporated by reference.
  • RIPPER is a widely-used and effective rule induction system. This RIPPER algorithm is trained over a corpus of manually marked pairs of text units continued in the training corpus 515.
  • a suitable corpus was constructed using a subset of the Topic Detection and Tracking (TDT) corpus developed by NIST and DARPA.
  • the TDT corpus in a collection of over 16,000 news articles from Reuters and CNN where many of the articles have been manually grouped into 25 categories each of which correspond to a single event.
  • the selected corpus was formed using the Reuters' articles in five of the twenty five categories from randomly selected days.
  • the resulting training corpus 515 contained 30 related articles.
  • the 30 articles provided 264 paragraphs which were selected as the small text segments and resulted in 10,345 comparisons between segments.
  • a machine learning algorithm can add the total value of composite features found in the text segments and compare this value against a similarity threshold.
  • feature values can be predetermined based on human experience through the use of a look-up table.
  • all features can be given a binary value and the similarity comparison can be determined based on a simple accumulated count of detected primary and composite features.
  • the present methods while evaluated on a corpus of English language documents, are not language specific and are generally applicable to any language. Of course, the individual subroutines may require some alteration to accommodate the varied constructions found in different languages.
  • the methods for determining similarity in small text segments described herein form an important component in larger systems, such as document archiving systems and multi-document summarization systems.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention porte sur un système et un procédé visant à déterminer une similarité dans de courts segments d'un texte. Ce procédé permet de définir une similarité qui est appropriée pour de petits segments (100) du texte. De petits segments de texte sont comparés de façon à déterminer s'il existe des caractéristiques primitives communes telles que des mots, des syntagmes nominaux, des synonymes, des verbes avec une classe sémantique commune, des noms propres et analogues (105). A partir de l'identification des caractéristiques primitives, les petits segments du texte sont évalués pour déterminer s'il existe des caractéristiques composites (110). Ces caractéristiques composites sont définies sous forme de relations prédéterminées entre des caractéristiques primitives. Les caractéristiques primitives et composites communes sont appliquées sous forme d'entrées à un algorithme d'apprentissage de machine approprié qui est testé pour déterminer une mesure de similarité à partir des caractéristiques primitives et composites communes aux segments (115) du texte.
PCT/US2000/040238 1999-06-18 2000-06-19 Systeme et procede de detection de similarite de texte sur de courts passages Ceased WO2000079426A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP00951059A EP1203309A4 (fr) 1999-06-18 2000-06-19 Systeme et procede de detection de similarite de texte sur de courts passages

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US13993099P 1999-06-18 1999-06-18
US60/139,930 1999-06-18

Publications (1)

Publication Number Publication Date
WO2000079426A1 true WO2000079426A1 (fr) 2000-12-28

Family

ID=22488940

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/040238 Ceased WO2000079426A1 (fr) 1999-06-18 2000-06-19 Systeme et procede de detection de similarite de texte sur de courts passages

Country Status (2)

Country Link
EP (1) EP1203309A4 (fr)
WO (1) WO2000079426A1 (fr)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1556784A4 (fr) * 2002-05-31 2006-06-07 Eli Abir Procede et appareil d'association de mots
WO2009005492A1 (fr) * 2007-06-29 2009-01-08 United States Postal Service Systèmes et procédés pour valider une adresse
US7769778B2 (en) 2007-06-29 2010-08-03 United States Postal Service Systems and methods for validating an address
CN102279843A (zh) * 2010-06-13 2011-12-14 北京四维图新科技股份有限公司 处理短语数据的方法以及装置
CN103176962B (zh) * 2013-03-08 2015-11-04 深圳先进技术研究院 文本相似度的统计方法及系统
CN106649222A (zh) * 2016-12-13 2017-05-10 浙江网新恒天软件有限公司 基于语义分析与多重Simhash的文本近似重复检测方法
CN107562824A (zh) * 2017-08-21 2018-01-09 昆明理工大学 一种文本相似度检测方法
CN108846117A (zh) * 2018-06-26 2018-11-20 北京金堤科技有限公司 商业快讯的去重筛选方法及装置
US10282678B2 (en) 2015-11-18 2019-05-07 International Business Machines Corporation Automated similarity comparison of model answers versus question answering system output
US10628749B2 (en) 2015-11-17 2020-04-21 International Business Machines Corporation Automatically assessing question answering system performance across possible confidence values
US10657525B2 (en) 2017-06-27 2020-05-19 Kasisto, Inc. Method and apparatus for determining expense category distance between transactions via transaction signatures
CN111581947A (zh) * 2020-04-29 2020-08-25 华南理工大学 一种相似文本标定方法
WO2020197985A1 (fr) * 2019-03-22 2020-10-01 Servicenow, Inc. Détermination de similarité sémantique de textes sur la base de sous-sections de ceux-ci
US10943242B2 (en) 2008-02-21 2021-03-09 Micronotes, Inc. Interactive marketing system
KR102572106B1 (ko) * 2023-05-15 2023-08-29 (주) 애드캐리 마케팅 방법에 활용되는 문서 자동 변환 시스템

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5278980A (en) * 1991-08-16 1994-01-11 Xerox Corporation Iterative technique for phrase query formation and an information retrieval system employing same
US5675819A (en) * 1994-06-16 1997-10-07 Xerox Corporation Document information retrieval using global word co-occurrence patterns
US5794178A (en) * 1993-09-20 1998-08-11 Hnc Software, Inc. Visualization of information using graphical representations of context vector based relationships and attributes
US5893095A (en) * 1996-03-29 1999-04-06 Virage, Inc. Similarity engine for content-based retrieval of images
US5943669A (en) * 1996-11-25 1999-08-24 Fuji Xerox Co., Ltd. Document retrieval device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5835905A (en) * 1997-04-09 1998-11-10 Xerox Corporation System for predicting documents relevant to focus documents by spreading activation through network representations of a linked collection of documents

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5278980A (en) * 1991-08-16 1994-01-11 Xerox Corporation Iterative technique for phrase query formation and an information retrieval system employing same
US5794178A (en) * 1993-09-20 1998-08-11 Hnc Software, Inc. Visualization of information using graphical representations of context vector based relationships and attributes
US5675819A (en) * 1994-06-16 1997-10-07 Xerox Corporation Document information retrieval using global word co-occurrence patterns
US5893095A (en) * 1996-03-29 1999-04-06 Virage, Inc. Similarity engine for content-based retrieval of images
US5943669A (en) * 1996-11-25 1999-08-24 Fuji Xerox Co., Ltd. Document retrieval device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP1203309A4 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1556784A4 (fr) * 2002-05-31 2006-06-07 Eli Abir Procede et appareil d'association de mots
WO2009005492A1 (fr) * 2007-06-29 2009-01-08 United States Postal Service Systèmes et procédés pour valider une adresse
US7769778B2 (en) 2007-06-29 2010-08-03 United States Postal Service Systems and methods for validating an address
US10943242B2 (en) 2008-02-21 2021-03-09 Micronotes, Inc. Interactive marketing system
CN102279843A (zh) * 2010-06-13 2011-12-14 北京四维图新科技股份有限公司 处理短语数据的方法以及装置
CN103176962B (zh) * 2013-03-08 2015-11-04 深圳先进技术研究院 文本相似度的统计方法及系统
US10628749B2 (en) 2015-11-17 2020-04-21 International Business Machines Corporation Automatically assessing question answering system performance across possible confidence values
US10282678B2 (en) 2015-11-18 2019-05-07 International Business Machines Corporation Automated similarity comparison of model answers versus question answering system output
CN106649222A (zh) * 2016-12-13 2017-05-10 浙江网新恒天软件有限公司 基于语义分析与多重Simhash的文本近似重复检测方法
US10657525B2 (en) 2017-06-27 2020-05-19 Kasisto, Inc. Method and apparatus for determining expense category distance between transactions via transaction signatures
CN107562824B (zh) * 2017-08-21 2020-10-27 昆明理工大学 一种文本相似度检测方法
CN107562824A (zh) * 2017-08-21 2018-01-09 昆明理工大学 一种文本相似度检测方法
CN108846117A (zh) * 2018-06-26 2018-11-20 北京金堤科技有限公司 商业快讯的去重筛选方法及装置
WO2020197985A1 (fr) * 2019-03-22 2020-10-01 Servicenow, Inc. Détermination de similarité sémantique de textes sur la base de sous-sections de ceux-ci
US11151325B2 (en) 2019-03-22 2021-10-19 Servicenow, Inc. Determining semantic similarity of texts based on sub-sections thereof
JP2022527060A (ja) * 2019-03-22 2022-05-30 サービスナウ, インコーポレイテッド テキストの意味的類似性をそのサブセクションに基づいて決定すること
AU2020248738B2 (en) * 2019-03-22 2023-09-14 Servicenow, Inc. Determining semantic similarity of texts based on sub-sections thereof
JP2024020653A (ja) * 2019-03-22 2024-02-14 サービスナウ, インコーポレイテッド テキストの意味的類似性をそのサブセクションに基づいて決定すること
US12299397B2 (en) 2019-03-22 2025-05-13 Servicenow, Inc. Determining semantic similarity of texts based on sub-sections thereof
JP7730880B2 (ja) 2019-03-22 2025-08-28 サービスナウ, インコーポレイテッド テキストの意味的類似性をそのサブセクションに基づいて決定すること
CN111581947A (zh) * 2020-04-29 2020-08-25 华南理工大学 一种相似文本标定方法
KR102572106B1 (ko) * 2023-05-15 2023-08-29 (주) 애드캐리 마케팅 방법에 활용되는 문서 자동 변환 시스템

Also Published As

Publication number Publication date
EP1203309A1 (fr) 2002-05-08
EP1203309A4 (fr) 2006-06-21

Similar Documents

Publication Publication Date Title
Munot et al. Comparative study of text summarization methods
US7707023B2 (en) Method of finding answers to questions
Al-Hashemi Text Summarization Extraction System (TSES) Using Extracted Keywords.
Kobayashi et al. Citation recommendation using distributed representation of discourse facets in scientific articles
EP2354967A1 (fr) Analyse textuelle sémantique
EP1429258A1 (fr) Procede et systeme de traitement de donnees et programme
Salvetti et al. Automatic opinion polarity classification of movie reviews
US20070129934A1 (en) Method and system of language detection
Muresan et al. Combining linguistic and machine learning techniques for email summarization
EP1203309A1 (fr) Systeme et procede de detection de similarite de texte sur de courts passages
US20070016863A1 (en) Method and apparatus for extracting and structuring domain terms
KR20210119041A (ko) 군집 기반 중복문서 제거 장치 및 제거 방법
Dai et al. A new statistical formula for Chinese text segmentation incorporating contextual information
Takale et al. Measuring semantic similarity between words using web documents
Hussein Arabic document similarity analysis using n-grams and singular value decomposition
US7072827B1 (en) Morphological disambiguation
Mohemad et al. Performance analysis in text clustering using k-means and k-medoids algorithms for Malay crime documents
Agichtein et al. Predicting accuracy of extracting information from unstructured text collections
Shrestha Corpus-based methods for short text similarity
Selvaretnam et al. A linguistically driven framework for query expansion via grammatical constituent highlighting and role-based concept weighting
Alias et al. A Malay text corpus analysis for sentence compression using pattern-growth method
El-Shayeb et al. Comparative analysis of different text segmentation algorithms on arabic news stories
Sánchez et al. Discovering non-taxonomic relations from the Web
Omar Addressing the problem of coherence in automatic text summarization: A latent semantic analysis approach
Thambi et al. Graph based document model and its application in keyphrase extraction

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2000951059

Country of ref document: EP

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWP Wipo information: published in national office

Ref document number: 2000951059

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 10018108

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Ref document number: 2000951059

Country of ref document: EP