[go: up one dir, main page]

WO2008119297A1 - Method for matching character string based on characteristic parameters - Google Patents

Method for matching character string based on characteristic parameters Download PDF

Info

Publication number
WO2008119297A1
WO2008119297A1 PCT/CN2008/070603 CN2008070603W WO2008119297A1 WO 2008119297 A1 WO2008119297 A1 WO 2008119297A1 CN 2008070603 W CN2008070603 W CN 2008070603W WO 2008119297 A1 WO2008119297 A1 WO 2008119297A1
Authority
WO
WIPO (PCT)
Prior art keywords
array
text
stored
characters
characteristic
Prior art date
Application number
PCT/CN2008/070603
Other languages
French (fr)
Chinese (zh)
Inventor
Guangyao Ding
Original Assignee
Guangyao Ding
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangyao Ding filed Critical Guangyao Ding
Publication of WO2008119297A1 publication Critical patent/WO2008119297A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Definitions

  • the present invention relates to a string matching method, and in particular to a character string matching method based on a characteristic parameter. Background technique
  • Dictionary search is the most basic application of string matching technology.
  • the retrieval techniques of existing dictionary retrieval products fall into two categories: retrieval techniques based on exact matching, and retrieval techniques based on inexact matching.
  • Accurate matching retrieval techniques are not fault-tolerant; rather than exact matching retrieval techniques, which allow for a small number of errors in user input, and therefore have a relatively low application value.
  • Table 1 reflects the polymorphisms that describe the error phenomena of a particular problem based on error factors.
  • Text search terms are based on error factors ABCDEFGH ABCDEFGH exact match (ie substring matching):
  • the matching problem between the specific text and the search term has multiple qualitative representation methods based on the error factor, which reflects the polymorphism describing the same problem and is not convenient for classification processing.
  • An object of the present invention is to overcome the deficiencies in the prior art described above, and to provide a character string matching method based on characteristic parameters that is more reasonable in sorting retrieval results and has strong fault tolerance.
  • a string matching method based on characteristic parameters of the present invention given a text stored in a storage device, and a search term input by the input device, characterized in that the information processing device gives a given Text and search terms are matched based on string of characteristic parameters.
  • the steps are:
  • Step A calculating the matching relationship between the text and the characters in the search term
  • Step B) calculating a characteristic parameter according to the matching relationship between the text and the characters in the search term, the characteristic parameter includes a discrete number reflecting the number of discrete characters in which the characters of the search word appear in the text, and reflecting the intersection of each character of the search word in the text The number of intersections of the number of characters, reflecting the incomplete number of characters whose characters do not appear in the text;
  • Step C calculating the characteristic matching degree according to the characteristic parameter, the text and the length of the search term
  • Step D output characteristic matching degree.
  • ABCDEFGH ABCDEFGH exact match ie substring match: exact match (ie substring does not allow deletion, insertion, swapping):
  • ABCDEFGH CDEF exact match ie substring match: exact match (ie substrings only allow front and back deletes):
  • the three characteristic parameters used in the present invention namely, discreteness, cross-overness, incompleteness, and completely different concepts and properties, are mutually independent characteristics.
  • the error factor is the external manifestation of the three characteristics. Based on the string matching research ideas of the three characteristic parameters, the inherent law of the string matching problem is more scientifically revealed.
  • the three characteristic parameters adopted by the present invention are: discreteness, crossover, incompleteness, and mutual Different concepts and properties are mutually independent features.
  • the influence of the three characteristics on the characteristic matching degree can be separately considered, so that the calculated characteristic matching degree more accurately reflects the degree of similarity between the text and the search term. Therefore, according to the feature matching degree obtained by the invention, the sorting output of all the matching results of the dictionary words is more reasonable, the fault tolerance capability is greatly improved, and the defect of the comprehensive calculation based on the error factor distance calculation result is too general and is not conducive to the matching degree.
  • DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The string matching method based on characteristic parameters of the present invention will be further described in detail below with reference to specific embodiments.
  • the electronic English dictionary is an electronic dictionary library composed of English words.
  • the electronic English dictionary search refers to: performing string matching operation on each word in the electronic English dictionary library, that is, the text T according to the input English word, that is, the search word P, and sorting the words satisfying the condition according to the matching result Output, user-friendly choice.
  • the core technology of the electronic English dictionary is string matching.
  • the matching result directly affects the final sorting position of all detected words, and is also an important indicator for measuring the retrieval effect of electronic English dictionary.
  • the array PT also stores the original position of each character in the search term, respectively Is the character sub-array PTc stored in the array PT and the position sub-array PTp stored in the array PT; b step), stable sorting text T
  • step f If the array WT comparison ends or the array PT comparison ends, then go to step f;
  • the value stored in the position W in the WTC is stored in the array POS, and the storage position is the value stored in the position PTp; if the position W in the WTp is stored The value > maximum position, the value stored in the position W in the WTp is stored in the maximum position; if the value stored in the position W in the WTp ⁇ the minimum position, the value stored in the position W in the WTp is stored in the minimum position; Increase by 1, position P increases by 1, and turns to step d;
  • This method of ordering text and search terms, and calculating the matching relationship of characters can improve the calculation speed.
  • the step of calculating the characteristic parameter according to the matching relationship between the text in the B step and the character in the search term is:
  • incomplete number (non-complete number + m-position P+1);
  • the foregoing step of determining the length of the maximum intervalable ascending sequence may be: (1), initialization
  • step (4) If the comparison data is the end marker, go to step (4);
  • comparison data is larger than the data at the LP position in the array LPOS, then LP is incremented by 1.
  • the comparison data is stored in the LP position of the array LPOS, and the next value in the temporary array is taken to the comparison data, and the step (2) is performed;
  • comparison data is smaller than the data of the LP position in the array LPOS, the second position in the array LPOS is searched backwards, the first data larger than the comparison data is searched, and the data is overwritten with the comparison data; the next one in the temporary array is taken.
  • the array POS is used to store the position of the matched characters in the search word in the matching process.
  • the characteristics of the data in the array P0S are: except for the value -1, other data are integers greater than 0, and are not mutually equal. Since the value -1 represents an unmatched character, the value -1 is not counted in the maximum separable ascending sequence.
  • the length of the maximum interval-associated sequence of the array P0S is: According to the size of the data in the array P0S, a maximum interval ascending sequence is found in the array POS, and the number of data of the sequence is the length of the maximum interval ascending sequence.
  • the subsequence a kl ak2 . . . is the maximum interval ascending sequence of the sequence ai a 2 . . . a n , and the number of elements m is its length.
  • the number of intersections when the text matches the search term can be found.
  • the role of the number of intersections is to match the degree of matching of discrete numbers and incomplete numbers to facilitate the sorting of texts to meet the user's retrieval requirements.
  • the time complexity of the algorithm is: O(mlo g2 (m .
  • the step of calculating the characteristic matching degree according to the characteristic parameter, the text, and the length of the search term of the C ⁇ is:
  • Step a) calculating the number of related characters of the actual matching characters of the search term and the text
  • Number of related characters 2 ⁇ (m-non-complete number);
  • Step b) calculate the influence factor of the characteristic parameters on the characteristic matching degree 1
  • Step c) calculate the influence factor of the characteristic parameters on the characteristic matching degree 2
  • Influence factor 2 qi x non-complete number + q 2 x cross number + q 3 x discrete number
  • Characteristic matching degree (related number of characters - impact factor 1) ⁇ (m+n+ influence factor 2);
  • qi , q 2 , and q 3 are the weight coefficients of the characteristic parameters in the characteristic matching degree, which are real numbers greater than or equal to zero and less than or equal to 2, and qi , q 2 , and q 3 are real numbers greater than or equal to zero, and the weight coefficient! ⁇ , qi , q 2 , q 3 , can choose different values according to different products and different applications, thus affecting the matching degree of the retrieved text and affecting the sorting of the retrieved text.
  • impact factors aims to comprehensively consider the influence weights of different characteristic parameters on the feature matching degree according to different products and different application environments, so that the ranking of search results is more in line with the user requirements of special environment applications.
  • the characteristic matching degree is a real number that satisfies zero or more and less than or equal to 1.
  • the target word is: "wonderful"
  • the target word is: "what"
  • the target word is: "error"
  • the input characters are: irror'
  • the target word is: "marvelous"
  • the input character is: "mvrilus"
  • the target word is: "marvelous"
  • the search results are as follows: 1 remarkable 2 muscular 3 marxist 4 marxism 5 luxurious characteristic matching degree: 0.366 0.317 0.312 0.312 0.302
  • a string matching method based on characteristic parameters of the present invention, the text T
  • step A the step of calculating the characteristic matching degree according to the characteristic parameter, the text and the length of the search term of the C ⁇ is:
  • Step a) calculating the number of related characters of the actual matching characters of the search term and the text
  • Number of related characters 2x (m-non-complete number);
  • Step b) calculate the influence factor of the characteristic parameters on the characteristic matching degree 1
  • Step c) calculate the influence factor of the characteristic parameters on the characteristic matching degree 2
  • Impact factor 2 incomplete number + q 2 x crossing number + q 3 x discrete number
  • Step d) calculate the sum of the cognitive weights of the matched characters in the text
  • characteristic matching degree [(related number of characters - influence factor 1) ⁇ (m + n + influence factor 2 ) ] X The sum of cognitive weights.
  • weight coefficients, qi , q 2 , q 3 are the weight coefficients of the characteristic parameters in the characteristic matching degree, which are real numbers greater than or equal to zero and less than or equal to 2, qi , q 2 , q 3 are real numbers greater than or equal to zero, weight coefficients, qi , q 2 , q 3 , can choose different values according to different products and different applications, thus affecting the matching degree of the retrieved text and affecting the sorting of the retrieved text.
  • the introduction of impact factors is based on different products and different application environments.
  • the characteristic parameters affect the feature matching degree, so that the sorting of the search results is more in line with the user requirements of the special environment application.
  • the above method of increasing the influence of cognitive weight on the degree of characteristic matching is an improvement on the calculation of the characteristic matching degree, which conforms to the actual cognitive difference of special symbols, and integrates psychology, behavioral science, linguistics, statistics, etc. Multidisciplinary cognitive thinking, especially suitable for dictionary retrieval.
  • the weight of characters that are easy to recognize is strengthened, and the weight of characters that are prone to errors is diluted, and the degree of matching of the characteristics is calculated, and the order of the detected results is more in line with the requirements of the user.
  • the main factors determining the cognitive weight are:
  • the electronic English dictionary is an electronic dictionary library consisting of English words and cognitive weights.
  • the method of the second embodiment is basically the same as the method of the first example. The only difference is that each character in the text T stores a cognitive weight correspondingly, which increases the influence of the cognitive weight on the matching degree of the computing characteristic.
  • the score is 0.4;
  • the score is 0.3;
  • the string matching method based on characteristic parameters of the invention has strong fault-tolerant retrieval capability and is suitable for dictionary retrieval.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for matching a character string based on characteristic parameters is provided. There are a text that is stored in a given storage device and a searching word. The characteristic parameters include a discrete number, a crossing number and a non-identical number. The method for matching the character string of the characteristic parameters comprises: calculating a matching relationship between the characters in the text and the searching word; calculating the characteristic parameters according to the matching relationship between the characters in the text and the searching word; calculating a characteristic matching degree according to the length of the characteristic parameters, the text, and the searching word; returning the characteristic matching degree.

Description

基于特性参数的字符串匹配方法 技术领域 本发明涉及一种字符串匹配方法, 具体来讲, 涉及一种基于特性参数的 字符串匹配方法。 背景技术  TECHNICAL FIELD The present invention relates to a string matching method, and in particular to a character string matching method based on a characteristic parameter. Background technique
 Say
词典检索是字符串匹配技术最为 1基本的应用。 现有的词典检索产品的检索 技术分为两类: 基于精确匹配的检索技术, 以及基于非精确匹配的检索技术。  Dictionary search is the most basic application of string matching technology. The retrieval techniques of existing dictionary retrieval products fall into two categories: retrieval techniques based on exact matching, and retrieval techniques based on inexact matching.
 Book
精确匹配检索技术不能容错; 而非精确匹配的检索技术, 允许用户输入中出现 少量的错误, 因此有较髙的应用价值。 Accurate matching retrieval techniques are not fault-tolerant; rather than exact matching retrieval techniques, which allow for a small number of errors in user input, and therefore have a relatively low application value.
四十年来, 国内外对非精确字符串匹配的方法研究一直采用基于错误因素 的距离计算, 最常用的是 Levenshtein距离以及 ED ( Edit Distance) 距离, 影 响距离结果的错误因素主要包括插入错误、 删除错误、 替换错误、 交换错误等。 这种基于错误因素距离计算方法, 存在一些固有问题, 造成了词典检索结果过 于笼统且容错能力有限, 问题主要体现在以下几点:  Forty years, domestic and foreign research on methods of inexact string matching has always used distance calculation based on error factors. The most commonly used distances are Levenshtein distance and ED (Edit Distance). The wrong factors affecting distance results mainly include insertion errors and deletion. Errors, replacement errors, exchange errors, etc. This method based on the error factor distance calculation has some inherent problems, which results in the dictionary search results being too general and limited in fault tolerance. The problems are mainly reflected in the following points:
1 )、 现有的基于错误因素距离计算的非精确字符串匹配研究思路, 是一种 基于问题现象的研究思路, 如插入、 删除、 替换、 交换、 反向错误等。 这些错 误现象并非完全独立、 可以严格界定, 呈现出多样化特征, 是典型的问题现象。 例如, 本质上可以用一个插入错误和一个删除错误定性地表示一个替换错误, 用一个插入错误和一个删除错误定性地表示一个交换错误。 因此, 某些错误因 素并非独立概念, 字符串匹配至今没有形成科学的分类体系, 这是其中的重要 原因之一。  1) The existing research method of inexact string matching based on error factor distance calculation is a research idea based on problem phenomenon, such as insertion, deletion, replacement, exchange, reverse error and so on. These errors are not completely independent, can be strictly defined, and exhibit diverse characteristics, which are typical problem phenomena. For example, it is essentially possible to qualitatively represent a replacement error with an insertion error and a deletion error, and qualitatively indicate an exchange error with an insertion error and a deletion error. Therefore, some error factors are not independent concepts, and string matching has not yet formed a scientific classification system, which is one of the important reasons.
2 )、 现有的基于错误因素描述字符串匹配问题的多态性, 直接影响到字符 串匹配结果以及检索结果的排序。 表 1 反映出基于错误因素对特定问题的错误 现象进行描述的多态性。 文本 检索词 基于错误因素描述 ABCDEFGH ABCDEFGH 精确匹配 (即子串匹配): 2), the existing polymorphism based on the error factor to describe the string matching problem directly affects the string matching result and the sorting of the retrieval result. Table 1 reflects the polymorphisms that describe the error phenomena of a particular problem based on error factors. Text search terms are based on error factors ABCDEFGH ABCDEFGH exact match (ie substring matching):
不允许删除、 插入、 交换等错误  Delete, insert, swap, etc. errors are not allowed
ABCDEFGH CDEF 精确匹配 (即子串匹配):  ABCDEFGH CDEF exact match (ie substring matching):
允许前面、 后面的删除  Allow front and back deletion
ABCDEFGH CDF 非精确匹配:  ABCDEFGH CDF non-exact match:
1、 存在一个删除 (E); 或  1. There is a deletion (E); or
2、 存在若干前面、 后面删除,  2, there are a number of front and back deletions,
并且存在一个中间删除 (E); 或  And there is an intermediate deletion (E); or
3、 存在一个插入 (F) ; 或  3. There is an insertion (F); or
4、 存在一个替换 (E,F);4, there is a replacement (E, F) ; etc.
ABCDEFGH CEDF 非精确匹配:  ABCDEFGH CEDF non-exact match:
1、 存在一个交换 (DE, ED); 或  1. There is an exchange (DE, ED); or
2、 存在两个替换 (D,E),(E,D);2. There are two substitutions (D, E), (E, D) ; or
3、 存在一个插入 (D)和一个删除 (D); 等  3. There is an insert (D) and a delete (D);
ABCDEFGH CEFD 非精确匹配:  ABCDEFGH CEFD non-exact match:
1、 存在一个删除 (D)和一个插入 (D); 或 1. There is a delete (D) and an insert (D); or
2、 存在两个插入 (; C D); 或 2, there are two inserts (; C D); or
3、 存在两个插入 (E F); 等  3. There are two insertions (E F); etc.
ABCDEFGH ACEFXD 非精确匹配:  ABCDEFGH ACEFXD Inexact Match:
1、 存在两个删除 (B) D)和两个  1, there are two deletes (B) D) and two
插入 (X) ,(D); 或  Insert (X), (D); or
2、 存在两个删除 (B) ,(D)和两个  2, there are two deletes (B), (D) and two
替换 (G,X),(H,D) ; 等  Replace (G, X), (H, D); etc.
表 1  Table 1
表 1 中, 忽略距离计算的量化影响, 基于错误因素对特定文本与检索词的 匹配问题, 存在多种基于错误因素的定性表示方法, 反映出描述同一问题的多 态性, 不便于分类处理。  In Table 1, the quantitative influence of the distance calculation is ignored. Based on the error factor, the matching problem between the specific text and the search term has multiple qualitative representation methods based on the error factor, which reflects the polymorphism describing the same problem and is not convenient for classification processing.
3 )、 基于错误因素距离计算的非精确字符串匹配方法, 由于通过距离计算 对各种错误因素进行统一量化处理, 如 ED(Edit Distance)距离计算, 使得匹配中 不同错误因素的性质模糊化, 距离反映出的匹配结果过于笼统。 例如距离为 2, 即可表示含有两个插入错误, 也可表示含有 2个删除错误, 或者两个替换错误, 或者 2 个混合的错误。 而错误因素概念的非独立性, 以及匹配中的错误性质的 不确定性, 又使得不能根据错误因素将匹配状况进一步细化表示。 因此, 词典 检索中, 在计算匹配度时, 缺少准确的匹配状况的更为细致的参数依据, 不利 于检出结果的合理排序。 3) Inaccurate string matching method based on error factor distance calculation, because the various error factors are uniformly quantized by distance calculation, such as ED (Edit Distance) distance calculation, so that matching The nature of different error factors is blurred, and the matching results reflected by distance are too general. For example, a distance of 2 means that there are two insertion errors, or two deletion errors, or two replacement errors, or two mixed errors. The non-independence of the concept of error factors and the uncertainty of the nature of the errors in the matching make it impossible to further refine the matching situation according to the wrong factors. Therefore, in the dictionary search, when calculating the matching degree, the more detailed parameter basis of the accurate matching condition is lacking, which is not conducive to the reasonable sorting of the detected results.
4)、 现有的词典检索, 很少从心理学、 认知学、 语言学、 行为学的角度, 讨 论对词典检索的影响。 实际上, 每一个字符, 根据在单词中的位置、 发音、 视 觉等因素, 存在不同程度的认知差异, 有些字符容易记住, 而有些字符则不易 记住, 或者随着时间的推移而逐渐淡化, 这也是造成非精确输入的主要原因。 因此检索中, 应考虑各字符在单词中的认知差异对词典检索结果的影响。  4) Existing dictionary searches rarely discuss the impact of dictionary search from the perspectives of psychology, cognition, linguistics, and behavior. In fact, each character has different levels of cognitive differences depending on factors such as position, pronunciation, and visual in the word. Some characters are easy to remember, while others are difficult to remember, or gradually evolve over time. Desalination, which is also the main cause of inaccurate input. Therefore, in the search, the influence of the cognitive difference of each character in the word on the dictionary search result should be considered.
以上问题, 呈现出字符串匹配基础体系的不健全、 不完善, 直接影响到词 典检索结果的合理排序, 容错能力有限, 匹配方法的多样性, 但却难以展开综 合应用, 亟待解决。 发明内容 本发明的目的在于克服上述现有技术中的不足, 提供一种检索结果排序更 为合理、 具有很强的容错能力的基于特性参数的字符串匹配方法。  The above problems show that the string matching basic system is not perfect and imperfect, which directly affects the reasonable ordering of the dictionary search results, the fault tolerance is limited, and the matching methods are diverse, but it is difficult to carry out comprehensive application, which needs to be solved urgently. SUMMARY OF THE INVENTION An object of the present invention is to overcome the deficiencies in the prior art described above, and to provide a character string matching method based on characteristic parameters that is more reasonable in sorting retrieval results and has strong fault tolerance.
为实现上述发明目的, 本发明的一种基于特性参数的字符串匹配方法, 给 定存储于存储设备中的一个文本, 以及输入设备输入的检索词, 其特征在于, 信息处理设备对给定的文本与检索词进行基于特性参数的字符串匹配, 步骤为: In order to achieve the above object, a string matching method based on characteristic parameters of the present invention, given a text stored in a storage device, and a search term input by the input device, characterized in that the information processing device gives a given Text and search terms are matched based on string of characteristic parameters. The steps are:
A步)、 计算文本与检索词中字符的匹配关系; Step A), calculating the matching relationship between the text and the characters in the search term;
B步)、 根据文本与检索词中字符的匹配关系计算特性参数, 特性参数包括 反映检索词各字符出现在文本中的离散的字符数的离散数, 反映检索词各字符 出现在文本中的交叉的字符数的交叉数, 反映检索词各字符未出现在文本中的 字符数的非完全数;  Step B), calculating a characteristic parameter according to the matching relationship between the text and the characters in the search term, the characteristic parameter includes a discrete number reflecting the number of discrete characters in which the characters of the search word appear in the text, and reflecting the intersection of each character of the search word in the text The number of intersections of the number of characters, reflecting the incomplete number of characters whose characters do not appear in the text;
C步)、 根据特性参数、 文本与检索词的长度计算特性匹配度;  Step C), calculating the characteristic matching degree according to the characteristic parameter, the text and the length of the search term;
D步)、 输出特性匹配度。  Step D), output characteristic matching degree.
与现有技术相比, 本发明的有益效果是: 一、通过表 2,现有基于错误因素及本发明基于三种特性的描述差异表的示 例比较, 得到清楚的说明。 Compared with the prior art, the beneficial effects of the present invention are: 1. Through Table 2, an example comparison based on the error factor and the description difference table based on the three characteristics of the present invention is clearly illustrated.
文本 检索词 现有基于错误因素的描 本发明基于三种特性 述 的描述 Text search term existing description based on error factors The invention is based on the description of three characteristics
ABCDEFGH ABCDEFGH 精确匹配(即子串匹配): 精确匹配 (即子串匹 不允许删除、插入、交换 配):  ABCDEFGH ABCDEFGH exact match (ie substring match): exact match (ie substring does not allow deletion, insertion, swapping):
等错误 不允许离散、 交叉、 非完全  Equal error does not allow discrete, cross, incomplete
ABCDEFGH CDEF 精确匹配(即子串匹配): 精确匹配 (即子串匹 只允许前面、后面的删除 配):  ABCDEFGH CDEF exact match (ie substring match): exact match (ie substrings only allow front and back deletes):
不允许离散、 交叉、 非完全  Discrete, cross, incomplete
ABCDEFGH CDF 非精确匹配: 离散匹配:  ABCDEFGH CDF Inexact Match: Discrete Match:
1、 存在一个删除 (E); 或 只存在一个离散 (E) 1. There is a deletion (E); or there is only one discrete (E)
2、 存在若干前面、 后面 2, there are a number of front, back
删除并且存在一个中间  Delete and exist in the middle
删除; 或  Delete; or
3、存在一个插入(F) ; 或  3. There is an insertion (F); or
4、 存在一个替换 (E,F);  4. There is a replacement (E, F);
ABCDEFGH CEDF 非精确匹配: 交叉匹配: ABCDEFGH CEDF Inexact Match: Cross Match:
1、 存在交换 (DE, ED) ; 只存在一个交叉 (D) 或  1. There is exchange (DE, ED); there is only one intersection (D) or
2 、 存在两个替换  2, there are two replacements
(D,E),(E,D);  (D, E), (E, D);
3、 存在插入 (D)和删除  3, there is insertion (D) and delete
(D); 等  (D); etc.
ABCDEFGH CEFD 非精确匹配: 交叉匹配:  ABCDEFGH CEFD Inexact Match: Cross Match:
1、 存在删除 (D)和插入 只存在一个交叉 (D) (D); 或 1. There is only one intersection (D) in the presence of deletion (D) and insertion. (D); or
2、存在两个插入 (C^D);  2. There are two insertions (C^D);
 Or
3、 存在两个插入 (Ε),0  3, there are two inserts (Ε), 0
 Inch
ABCDEFGH ACEFXD 非精确匹配: 离散交叉非完全匹  ABCDEFGH ACEFXD Inexact Match: Discrete Cross Non-Complete
1、 存在两个删除 (B UD) 配:  1, there are two deletes (B UD) with:
和两个插入 (X) ,(D); 或 存在一个离散 (B)、一 And two inserts (X), (D); or one discrete (B), one
2、 在两个删除 (B) ,(D) 个交叉 (D)、一个非完 和两个替换(G,X) , 全 (X) 2, delete in two (B), (D) cross (D), one non-finish and two replacement (G, X), all (X)
(H,D) ; 等 (H, D) ; etc.
Figure imgf000006_0001
Figure imgf000006_0001
可见,现有基于错误因素对特定文本与检索词的字符串匹配问题,存在多种 基于错误因素的描述方法,反映出描述同一问题的多态性,不便于细致的分类处 理。而采用本发明的三种特性对同一问题只存在一种描述方法,准确地反映了文 本与检索词的字符对应。  It can be seen that there are a variety of error-based description methods based on error factors for the matching of specific texts and search terms, which reflects the polymorphism describing the same problem, and is not convenient for detailed classification processing. However, using the three characteristics of the present invention, there is only one description method for the same problem, which accurately reflects the character correspondence between the text and the search term.
表 2中关于两个子串匹配问题, 现有基于错误因素有两种不同的描述方法; 本发明对两个子串匹配问题, 只存在一种描述方法, 更为符合子串的定义。  For the two substring matching problems in Table 2, there are two different description methods based on the error factors. In the present invention, there are only one description method for the two substring matching problems, which is more consistent with the definition of substrings.
通过上表对比,还可以清楚地了解到离散特性与删除错误的差异,交叉特性 与交换错误的差异, 非完全性与插入错误的差异。  From the comparison of the above table, the difference between discrete characteristics and deletion errors, the difference between cross characteristics and exchange errors, and the difference between incompleteness and insertion errors can be clearly understood.
本发明采用的三种特性参数, 即离散性、交叉性、 非完全性, 互相间具有完 全不同的概念和性质, 是互为独立的特性。错误因素是三种特性的外在表现, 基 于三种特性参数的字符串匹配研究思路,更为科学地揭示了字符串匹配问题的内 在规律。  The three characteristic parameters used in the present invention, namely, discreteness, cross-overness, incompleteness, and completely different concepts and properties, are mutually independent characteristics. The error factor is the external manifestation of the three characteristics. Based on the string matching research ideas of the three characteristic parameters, the inherent law of the string matching problem is more scientifically revealed.
二、 现有的错误因素概念的非独立性, 以及匹配中的错误性质的不确定性, 使得不能根据错误因素将匹配状况进一步细化表示。 因此, 词典检索中, 在计 算匹配度时, 缺少准确的匹配状况的更为细致的参数依据, 不利于检出结果的 合理排序。  Second, the non-independence of the existing concept of error factors, and the uncertainty of the nature of the errors in the matching, make it impossible to further refine the matching situation according to the wrong factors. Therefore, in the dictionary search, when calculating the matching degree, the more detailed parameter basis of the accurate matching condition is lacking, which is not conducive to the reasonable sorting of the detected results.
而本发明采用的三种特性参数: 离散性、 交叉性、 非完全性, 互相间具有 完全不同的概念和性质, 是互为独立的特性。 通过三种特性参数计算特性匹配 度, 可以分别考虑三种特性对特性匹配度的影响, 使得计算出的特性匹配度更 精确地反映文本与检索词的相似程度。 因此, 依据本发明得到的特性匹配度, 对词典单词所有的匹配结果的排序输出更为合理, 容错能力大大提高, 克服了 基于错误因素距离计算结果过于笼统、 不利于匹配度的综合计算的缺陷。 具体实施方式 下面结合具体实施方式, 对本发明基于特性参数的字符串匹配方法作进一 步详细的说明。 The three characteristic parameters adopted by the present invention are: discreteness, crossover, incompleteness, and mutual Different concepts and properties are mutually independent features. By calculating the characteristic matching degree through three characteristic parameters, the influence of the three characteristics on the characteristic matching degree can be separately considered, so that the calculated characteristic matching degree more accurately reflects the degree of similarity between the text and the search term. Therefore, according to the feature matching degree obtained by the invention, the sorting output of all the matching results of the dictionary words is more reasonable, the fault tolerance capability is greatly improved, and the defect of the comprehensive calculation based on the error factor distance calculation result is too general and is not conducive to the matching degree. . DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The string matching method based on characteristic parameters of the present invention will be further described in detail below with reference to specific embodiments.
电子英文词典是由英文单词构成的电子词典库。 电子英文词典检索是指: 根据输入的英文单词, 即检索词 P, 对电子英文词典库中的每一个单词, 即文本 T, 进行字符串匹配运算, 并根据匹配结果, 对满足条件的单词排序输出, 方便 用户的选择。  The electronic English dictionary is an electronic dictionary library composed of English words. The electronic English dictionary search refers to: performing string matching operation on each word in the electronic English dictionary library, that is, the text T according to the input English word, that is, the search word P, and sorting the words satisfying the condition according to the matching result Output, user-friendly choice.
电子英文词典的核心技术就是字符串匹配, 其匹配结果直接影响到所有检 出单词的最终排序位置, 也是衡量电子英文词典检索效果的重要指标。  The core technology of the electronic English dictionary is string matching. The matching result directly affects the final sorting position of all detected words, and is also an important indicator for measuring the retrieval effect of electronic English dictionary.
在本实施方式中, 给定存储于存储设备中的一个文本为1= "^ ·· tn", 以 及输入设备输入的检索词为 Ρ= "ΡΦ2〜ρω", 其中 ti 、 Pj ( l^ i ^n, l^ j ^m) 为字符, m、 n均大于等于 1, 所述 A歩的计算文本与检索词中字符的匹配关系 的具体歩骤为: In the present embodiment, a text stored in the storage device is 1 = "^ ·· t n ", and the search term input by the input device is Ρ = "ΡΦ 2 ρ ρ ω ", where ti , Pj ( l^ i ^n, l^ j ^m) is a character, m, n are greater than or equal to 1, the specific steps of the matching relationship between the calculated text of the A歩 and the characters in the search term are:
a步)、 稳定排序检索词 P  a step), stable sorting search term P
对检索词?= ^2... ... pm "中的所有字符, 进行稳定的升序排序, 并存储在 内存中数组 PT中, 数组 PT中同时存储了各字符在检索词中原来的位置, 分别 称为数组 PT中存储的字符子数组 PTc以及数组 PT中存储的位置子数组 PTp; b步)、 稳定排序文本 T For the search term? = ^ 2 ... ... p m "all characters in the stable ascending order, and stored in the memory array PT, the array PT also stores the original position of each character in the search term, respectively Is the character sub-array PTc stored in the array PT and the position sub-array PTp stored in the array PT; b step), stable sorting text T
对文本 T="tlt2... ... tn "中的所有字符, 进行稳定的升序排序, 并存储在内存 中数组 WT中, 数组 WT中同时存储了各字符在文本中原有的位置, 分别称为 数组 WT中存储的字符子数组 WTc以及数组 WT中存储的位置子数组 WTp; c步)、 参数初始化 All characters in the text T=" tl t 2 ... ... t n " are sorted in a stable ascending order and stored in the in-memory array WT. The array WT also stores the original characters in the text. Position, respectively called the character sub-array WTc stored in the array WT and the position sub-array WTp stored in the array WT; c step), parameter initialization
内存中数组 POS用于存储字符对应位置, 全部初始化为 -1, 非完全数 =0, 数组 WT的位置 W=l, 数组 PT的位置 P=l, 最大位置 =0, 最小位置 =n; The in-memory array POS is used to store the corresponding position of the character, all initialized to -1, non-complete number = 0, The position of the array WT is W=l, the position of the array PT is P=l, the maximum position=0, the minimum position=n;
d步)、 循环是否结束  d step), whether the loop ends
若数组 WT比较结束或者数组 PT比较结束, 则转 f步;  If the array WT comparison ends or the array PT comparison ends, then go to step f;
e步)、 比较  Step e), compare
根据 WTc中位置 W存储的字符与 PTc中位置 P存储的字符的比较情况, 分别进行以下情况处理:  According to the comparison between the characters stored in the position W in the WTc and the characters stored in the position P in the PTc, the following cases are respectively processed:
若 WTc中位置 W存储的字符 < PTc中位置 P存储的字符,则位置 W增加 1, 转 d步;  If the character stored in the position P of the character < PTc stored in the position W in the WTc is stored, the position W is incremented by 1, and the step d is performed;
若 WTp中位置 W存储的字符 >PTp中位置 Ρ存储的字符,则位置 Ρ增加 1, 非完全数增加 1, 转 d歩;  If the character stored in the position W in the WTp > PTp is stored in the character, the position Ρ is increased by 1, and the non-complete number is increased by 1, and d 歩;
若 WTc中位置 W存储的字符 = PTc中位置 P存储的字符, 则将 WTp中位 置 W存储的数值存储到数组 POS中, 其存储位置为 PTp中位置 Ρ存储的数值; 若 WTp中位置 W存储的数值 >最大位置, 则将 WTp中位置 W存储的数值存入 最大位置中; 若 WTp中位置 W存储的数值 <最小位置, 则将 WTp中位置 W存 储的数值存入最小位置中; 位置 W增加 1, 位置 P增加 1, 转 d步;  If the character stored in the position W in the WTC = the character stored in the position P in the PTc, the value stored in the position W in the WTp is stored in the array POS, and the storage position is the value stored in the position PTp; if the position W in the WTp is stored The value > maximum position, the value stored in the position W in the WTp is stored in the maximum position; if the value stored in the position W in the WTp < the minimum position, the value stored in the position W in the WTp is stored in the minimum position; Increase by 1, position P increases by 1, and turns to step d;
f歩)、 结束  f歩), end
得到代表文本与检索词中字符的匹配关系的数组 POS、 最大位置、 最小位 置、 位置 P以及非完全数。  Get an array POS, maximum position, minimum position, position P, and incomplete number that represent the matching relationship between the text and the characters in the search term.
这种有序化文本与检索词, 计算字符匹配关系的方法, 可提高计算速度。 时间复杂度为: 0(kX log2k), k=Max(m,n);  This method of ordering text and search terms, and calculating the matching relationship of characters can improve the calculation speed. The time complexity is: 0(kX log2k), k=Max(m,n);
在另一具体实施方式中, 在前一具体实施方式的基础上, 所述 B步的根据 文本与检索词中字符的匹配关系计算特性参数的步骤为:  In another specific embodiment, based on the previous embodiment, the step of calculating the characteristic parameter according to the matching relationship between the text in the B step and the character in the search term is:
a步)、 非完全数= (非完全数 +m-位置 P+1);  a step), incomplete number = (non-complete number + m-position P+1);
b步)、 离散数 =(最大位置-最小位置 +l-m+非完全数  Step b), discrete number = (maximum position - minimum position + l-m + non-complete number
C步)、 交叉数 =根据数组 POS结果进行交叉数计算。  Step C), number of intersections = cross-count calculation based on the POS result of the array.
前述交叉数计算的步骤可以为:  The steps of the foregoing cross number calculation may be:
( 1 )、 求最大可间隔升序序列的长度;  (1) Find the length of the maximum intervalable ascending sequence;
( 2 )、 交叉数 =m-非完全数-最大可间隔升序序列的长度。  (2), number of intersections = m - incomplete number - the length of the maximum intervalable ascending sequence.
前述求最大可间隔升序序列的长度的步骤可以为: ( 1 )、 初始化 The foregoing step of determining the length of the maximum intervalable ascending sequence may be: (1), initialization
依次将数组 POS中大于零的全部数值存入内存中的临时数组中, 临时数组 最后添加一个结束标志; 若临时数组中的元素个数等于 0,直接返回最大可间隔 升序序列的长度等于 0的结果; 若临时数组中的元素个数等于 1,直接返回最大 可间隔升序序列的长度等于 1的结果; 否则, 用内存中数组 LPOS存放最大可 间隔升序序列, 且数组 LPOS 的第一个位置的数值初始化为临时数组中第一个 数值; LP用于指示当前数组 LPOS处理的位置, 并初始化为 1 ; 取临时数组中 第二个数值到比较数据中;  In turn, all the values in the array POS greater than zero are stored in the temporary array in the memory, and the temporary array is finally added with an end flag; if the number of elements in the temporary array is equal to 0, the maximum interval of the ascending sequence can be directly returned to be equal to 0. Result; if the number of elements in the temporary array is equal to 1, the result of directly returning the maximum intervalable ascending sequence length is equal to 1; otherwise, the maximum interval ascending sequence is stored in the in-memory array LPOS, and the first position of the array LPOS The value is initialized to the first value in the temporary array; LP is used to indicate the position of the current array LPOS processing, and initialized to 1; take the second value in the temporary array to the comparison data;
(2)、 判断是否结束  (2), judge whether to end
若比较数据为结束标记, 转 (4) 步;  If the comparison data is the end marker, go to step (4);
(3 )、 根据比较进行两种情况处理  (3), according to the comparison, two cases are handled
若比较数据大于数组 LPOS中 LP位置的数据, 则 LP增加 1, 将比较数据 存储到数组 LPOS中 LP位置处,取临时数组中下一个数值到比较数据中,转(2) 步;  If the comparison data is larger than the data at the LP position in the array LPOS, then LP is incremented by 1. The comparison data is stored in the LP position of the array LPOS, and the next value in the temporary array is taken to the comparison data, and the step (2) is performed;
若比较数据小于数组 LPOS中 LP位置的数据, 则从数组 LPOS中第一个位 置向后进行折半査找, 搜索第一个大于比较数据的数据, 并用比较数据改写该 数据; 取临时数组中下一个数值到比较数据中, 转 (2) 歩;  If the comparison data is smaller than the data of the LP position in the array LPOS, the second position in the array LPOS is searched backwards, the first data larger than the comparison data is searched, and the data is overwritten with the comparison data; the next one in the temporary array is taken. Value to the comparison data, turn (2) 歩;
(4)、 最大可间隔升序序列的长度 = LP。  (4) The length of the maximum intervalable ascending sequence = LP.
数组 POS在匹配过程中, 用于存放检索词中已经匹配的字符在文本中的出 现位置, 数组 P0S中数据的特点是: 除数值 -1以外, 其它数据均为大于 0的整 数, 且互不相等。 由于数值 -1代表了未匹配的字符, 因此数值 -1不计入最大可 间隔升序序列中。  The array POS is used to store the position of the matched characters in the search word in the matching process. The characteristics of the data in the array P0S are: except for the value -1, other data are integers greater than 0, and are not mutually equal. Since the value -1 represents an unmatched character, the value -1 is not counted in the maximum separable ascending sequence.
数组 P0S的最大可间隔升序序列的长度是指: 按数组 P0S中数据的大小, 在数组 POS寻找出一个最大可间隔升序序列, 该序列的数据个数即为最大可间 隔升序序列的长度。  The length of the maximum interval-associated sequence of the array P0S is: According to the size of the data in the array P0S, a maximum interval ascending sequence is found in the array POS, and the number of data of the sequence is the length of the maximum interval ascending sequence.
最大可间隔升序序列以及长度的严格定义如下:  The maximum separable ascending sequence and the strict definition of length are as follows:
定义: 设任意序列 aia2... ... an ( a aj) ,每个元素可比较, 若存在最大的子 序歹 ij aklak2…… 满足 Definition: Let any sequence a ia2 ... a n ( a aj ) , each element can be compared, if there is the largest sub-sequence 歹 ij a kl ak2......
1、 l<kl<k2< <km<n且 2、 aki < ak2 <…… <akm 1, l<kl<k2<<km<n and 2, a k i < ak2 <... <a km
则称子序列 aklak2...... 为序列 aia2... ... an的最大可间隔升序序列, 元数个 数 m为其长度。 Then, the subsequence a kl ak2 . . . is the maximum interval ascending sequence of the sequence ai a 2 . . . a n , and the number of elements m is its length.
例如 7, 8, 9, 1, 2, 6, 3, 4, 12  For example 7, 8, 9, 1, 2, 6, 3, 4, 12
最大可间隔升序序列: 1, 2, 3, 4,12;  Maximum interval ascending sequence: 1, 2, 3, 4, 12;
最大可间隔升序序列长度: 5  Maximum interval ascending sequence length: 5
通过最大可间隔升序序列的长度, 即可求出文本与检索词匹配时的交叉数。 交叉数的作用是配合离散数、 非完全数计算特性匹配度, 方便检出文本的排序, 满足用户的检索要求。  By the length of the maximum intervalable ascending sequence, the number of intersections when the text matches the search term can be found. The role of the number of intersections is to match the degree of matching of discrete numbers and incomplete numbers to facilitate the sorting of texts to meet the user's retrieval requirements.
该算法的时间复杂度为: O(mlog2(m 。 The time complexity of the algorithm is: O(mlo g2 (m .
在另一具体实施方式中, 在前述 A步、 B歩具体实施方式的基础上, 所述 C歩的根据特性参数、 文本与检索词的长度计算特性匹配度的步骤为:  In another specific embodiment, based on the foregoing A step, the specific embodiment, the step of calculating the characteristic matching degree according to the characteristic parameter, the text, and the length of the search term of the C歩 is:
a步)、 计算检索词与文本的实际匹配字符的相关字符数  Step a), calculating the number of related characters of the actual matching characters of the search term and the text
相关字符数 =2χ (m-非完全数);  Number of related characters = 2χ (m-non-complete number);
b步)、 计算特性参数对特性匹配度的影响因子 1  Step b), calculate the influence factor of the characteristic parameters on the characteristic matching degree 1
影响因子 l= kl X交叉数; Impact factor l = k l X crossing number;
c步)、 计算特性参数对特性匹配度的影响因子 2  Step c), calculate the influence factor of the characteristic parameters on the characteristic matching degree 2
影响因子 2= qi x非完全数 + q2x交叉数 + q3 x离散数; Influence factor 2 = qi x non-complete number + q 2 x cross number + q 3 x discrete number;
d步)、 计算特性匹配度  d step), calculation characteristic matching degree
特性匹配度 = (相关字符数-影响因子 1 ) ÷ (m+n+影响因子 2);  Characteristic matching degree = (related number of characters - impact factor 1) ÷ (m+n+ influence factor 2);
其中, qi、 q2、 q3是各特性参数在特性匹配度中的权重系数, 为大于 等于零且小于等于 2的实数, qi、 q2、 q3为大于等于零的实数, 权重系数!^、 qi、 q2、 q3, 可根据不同产品、 不同应用场合选择不同的数值, 从而影响检索出的文 本的特性匹配度, 并影响检索出的文本的排序。 在一种具体应用中, 权重系数 ki qi, q2、 q3的值为 1^=2/3、 q尸 1、 q2=2/3、 q3=l/3。 Where qi , q 2 , and q 3 are the weight coefficients of the characteristic parameters in the characteristic matching degree, which are real numbers greater than or equal to zero and less than or equal to 2, and qi , q 2 , and q 3 are real numbers greater than or equal to zero, and the weight coefficient! ^, qi , q 2 , q 3 , can choose different values according to different products and different applications, thus affecting the matching degree of the retrieved text and affecting the sorting of the retrieved text. In a specific application, the weight coefficients ki qi, q 2 , q 3 have values of 1^=2/3, q corpse 1, q 2 =2/3, q 3 =l/3.
影响因子的引入, 目的在于根据不同产品, 不同应用环境, 综合考虑不同 特性参数对特性匹配度的影响权重, 从而使得检索结果的排序更为符合特殊环 境应用的用户要求。  The introduction of impact factors aims to comprehensively consider the influence weights of different characteristic parameters on the feature matching degree according to different products and different application environments, so that the ranking of search results is more in line with the user requirements of special environment applications.
在本实施例中, 该特性匹配度为满足大于等于零且小于等于 1的实数。 实例一 In this embodiment, the characteristic matching degree is a real number that satisfies zero or more and less than or equal to 1. Example one
下面是依据上述八、 B、 C步具体实施方式的方法具体设计的电子英文词典 库检索实例的检索结果与特点。  The following is the search results and characteristics of the electronic English dictionary search example designed according to the specific implementation methods of the above eight, B, and C steps.
本实例的电子英文词典库, 选择了六千个常用英语单词; 权重系数 ki、 qi、 q2、 q3选择为: 1^=2/3、 qi=l、 q2=2/3、 q3=l/3; 检索结果按计算出的特性匹配 度降序排列, 只输出前五个单词。 In the electronic English dictionary library of this example, six thousand common English words are selected; the weight coefficients ki, qi , q 2 , q 3 are selected as: 1^=2/3, qi =l, q 2 =2/3, q 3 = l/3; The search results are sorted in descending order of the calculated characteristic matching, and only the first five words are output.
1、 离散检索  1, discrete search
输入英文单词时可以任意省略单词中的字符。  You can arbitrarily omit characters from words when you type in English words.
例如: 目标单词为: "wonderful"  For example: The target word is: "wonderful"
输入字符为: "wdfl"  Enter the characters as: "wdfl"
检索结果为: 1 wonderful 2 handful 3 unfold 4 wind 5 windy 特性匹配度: 0.546 0.487 0.444 0.375 0.343 Search results for: 1 wonderful 2 handful 3 unfold 4 wind 5 windy Matching degree: 0.546 0.487 0.444 0.375 0.343
2、 交叉检索 2, cross search
输入英文单词时可以任意交叉单词中的字符。  You can arbitrarily cross the characters in a word when you type in English words.
例如: 目标单词为: "what"  For example: The target word is: "what"
输入字符为: "whta"  Enter the characters as: "whta"
检索结果为: l what 2 wheat 3 watch 4 hat 5 white  Search results for: l what 2 wheat 3 watch 4 hat 5 white
特性匹配度: 0.846 0.733 0.625 0.615 0.581  Characteristic matching degree: 0.846 0.733 0.625 0.615 0.581
3、 允许非完全字符  3, allow non-complete characters
输入英文单词时, 允许出现错误字符。  When you type an English word, the wrong character is allowed.
例如:目标单词为: "error"  For example: the target word is: "error"
输入字符为: irror'  The input characters are: irror'
检索结果为: 1 2 error 3 terror 4 terrorist territory 特性匹配度: 0.909 0.727 0.667 0.636 0.622 Search results for: 1 2 error 3 terror 4 dancing territory Characteristic matching: 0.909 0.727 0.667 0.636 0.622
4、 综合示例 4, a comprehensive example
输入英文单词时, 可以离散、 交叉、 非完全混合。  When you type English words, you can discrete, cross, and not completely mix.
例如:目标单词为: "marvelous"  For example: the target word is: "marvelous"
输入字符为: "mvrilus"  The input character is: "mvrilus"
检索结果为: 1 marvelous 2 various 3 survival 4 minus 5 visual 特性匹配度: 0.607 0.600 0.536 0.522 0.520 其中输入时, 省略了 a、 e、 o, 有一个错误字符 , 有一个交叉(v,r)。 Search results for: 1 marvelous 2 various 3 survival 4 minus 5 visual Characteristic matching degree: 0.607 0.600 0.536 0.522 0.520 When inputting, a, e, o are omitted, there is an error character, and there is an intersection (v, r).
5、 特殊示例  5, special examples
例如:目标单词为: "marvelous"  For example: the target word is: "marvelous"
输入字符为: "mrxxxxxxlus"  Enter the characters as: "mrxxxxxxlus"
检索结果为: 1 marvelous 2 muscular 3 marxist 4 marxism 5 luxurious 特性匹配度: 0.366 0.317 0.312 0.312 0.302 在另一具体实施方式中, 本发明的一种基于特性参数的字符串匹配方法, 所述的文本 T="W2...tn "中的每一个字符, 对应存储了一个认知权值 w, 组成了 文本的认知权值系列 W="wlW2...wn", 且满足 Wi+wz+. - .+w^l , 认知权值 w^ ^n)代表了字符在文本" ...tn "中被认知的概率; The search results are as follows: 1 marvelous 2 muscular 3 marxist 4 marxism 5 luxurious characteristic matching degree: 0.366 0.317 0.312 0.312 0.302 In another specific embodiment, a string matching method based on characteristic parameters of the present invention, the text T Each character in ="W 2 ... t n " stores a cognitive weight w, which constitutes the cognitive weight series W="w lW2 ...w n " of the text, and satisfies Wi +wz+. - .+w^l , the cognitive weight w^ ^n) represents the probability that the character is recognized in the text "...t n ";
前述在 A步、 B步具体实施方式的基础上, 所述的所述 C歩的根据特性参 数、 文本与检索词的长度计算特性匹配度的步骤为:  Based on the foregoing specific embodiments of step A and step B, the step of calculating the characteristic matching degree according to the characteristic parameter, the text and the length of the search term of the C歩 is:
a步)、 计算检索词与文本的实际匹配字符的相关字符数  Step a), calculating the number of related characters of the actual matching characters of the search term and the text
相关字符数 =2x (m-非完全数);  Number of related characters = 2x (m-non-complete number);
b步)、 计算特性参数对特性匹配度的影响因子 1  Step b), calculate the influence factor of the characteristic parameters on the characteristic matching degree 1
影响因子 l= kl X交叉数; Impact factor l = k l X crossing number;
c步)、 计算特性参数对特性匹配度的影响因子 2  Step c), calculate the influence factor of the characteristic parameters on the characteristic matching degree 2
影响因子 2= 非完全数 + q2x交叉数 + q3x离散数; Impact factor 2 = incomplete number + q 2 x crossing number + q 3 x discrete number;
d步)、 计算文本中已匹配字符的认知权值之和  Step d), calculate the sum of the cognitive weights of the matched characters in the text
根据数组 P0S 中已匹配文本字符的位置, 求出所有已匹配字符的认知权值 之和;  Find the sum of the cognitive weights of all matched characters based on the position of the matched text characters in the array P0S;
e歩)、 特性匹配度 =[ (相关字符数-影响因子 1 ) ÷ (m+n+影响因子 2 ) ] X认知权值之和。  e歩), characteristic matching degree = [(related number of characters - influence factor 1) ÷ (m + n + influence factor 2 ) ] X The sum of cognitive weights.
其中, qi、 q2、 q3是各特性参数在特性匹配度中的权重系数, 为大于 等于零且小于等于 2的实数, qi、 q2、 q3为大于等于零的实数, 权重系数 、 qi、 q2、 q3, 可根据不同产品、 不同应用场合选择不同的数值, 从而影响检索出的文 本的特性匹配度, 并影响检索出的文本的排序。 Where qi , q 2 , q 3 are the weight coefficients of the characteristic parameters in the characteristic matching degree, which are real numbers greater than or equal to zero and less than or equal to 2, qi , q 2 , q 3 are real numbers greater than or equal to zero, weight coefficients, qi , q 2 , q 3 , can choose different values according to different products and different applications, thus affecting the matching degree of the retrieved text and affecting the sorting of the retrieved text.
影响因子的引入, 目的在于根据不同产品, 不同应用环境, 综合考虑不同 特性参数对特性匹配度的影响权重, 从而使得检索结果的排序更为符合特殊环 境应用的用户要求。 The introduction of impact factors is based on different products and different application environments. The characteristic parameters affect the feature matching degree, so that the sorting of the search results is more in line with the user requirements of the special environment application.
上述增加认知权值对特性匹配度的影响方法, 是对特性匹配度计算的一种 改进, 符合人们对特殊符号的实际认知差异, 综合了心理学、 行为学、 语言学、 统计学等多学科的认知思想, 尤其适应于词典检索。 强化了容易认知的字符的 权重, 淡化了容易出错的字符的权重, 由此计算出的特性匹配度, 其检出结果 的排序更符合用户的要求。  The above method of increasing the influence of cognitive weight on the degree of characteristic matching is an improvement on the calculation of the characteristic matching degree, which conforms to the actual cognitive difference of special symbols, and integrates psychology, behavioral science, linguistics, statistics, etc. Multidisciplinary cognitive thinking, especially suitable for dictionary retrieval. The weight of characters that are easy to recognize is strengthened, and the weight of characters that are prone to errors is diluted, and the degree of matching of the characteristics is calculated, and the order of the detected results is more in line with the requirements of the user.
决定认知权值的主要因素有:  The main factors determining the cognitive weight are:
1 ) 字符是否单词的第一个位置;  1) whether the character is the first position of the word;
2 ) 字符是否为发音各音节的首字符;  2) Whether the character is the first character of each syllable of the pronunciation;
3 ) 字符在音节的发音中是否规范, 或者是否作用明显;  3) Whether the character is normalized in the pronunciation of the syllable, or whether the effect is obvious;
4 ) 字符在单词中视觉是否明显;  4) Whether the character is visually obvious in the word;
5 ) 字符是否为单词中的字符等。  5) Whether the character is a character or the like in the word.
实例二 Example two
根据前述有认知权值的基于特性参数的字符串匹配方法的具体实施方式, 本实例增加为有认知权值的电子英文词典检索。 电子英文词典是由英文单词以 及认知权值构成的电子词典库。  According to the specific implementation method of the character string matching method based on the characteristic parameter with the cognitive weight, the present example is added to the electronic English dictionary search with the cognitive weight. The electronic English dictionary is an electronic dictionary library consisting of English words and cognitive weights.
本实例二的方法与实例一的方法基本相同, 不同的仅仅是文本 T中的每一 个字符, 对应存储了一个认知权值, 增加了认知权值对计算特性匹配度的影响。  The method of the second embodiment is basically the same as the method of the first example. The only difference is that each character in the text T stores a cognitive weight correspondingly, which increases the influence of the cognitive weight on the matching degree of the computing characteristic.
下面是计算认知权值的一种方法:  Here's a way to calculate cognitive weights:
1、 字符是否单词的第一个位置, 分值 0.4;  1. Whether the character is the first position of the word, the score is 0.4;
2、 字符是否为发音各音节的首字符, 分值 0.3 ;  2. Whether the character is the first character of each syllable of the pronunciation, the score is 0.3;
3、 字符在音节的发音中是否规范或者是否作用明显, 分值 0.1 ;  3. Whether the character is normalized in the pronunciation of the syllable or whether it has a significant effect, the score is 0.1;
4、 字符在单词中视觉是否明显, 分值 0.2;  4. Whether the character is visually obvious in the word, the score is 0.2;
5、 是单词中的字符, 分值 1。  5, is the character in the word, the score is 1.
例如: 考虑英语单词 "what"。  For example: Consider the English word "what".
字符 w满足 1、 2、 3、 4、 5, 字符 w分值 =2;  The character w satisfies 1, 2, 3, 4, 5, character w score = 2;
字符 h满足 4、 5, 字符 h分值 =1.2;  The character h satisfies 4, 5, and the character h score = 1.2;
字符 a满足 3、 5, 字符 a分值 =1.1 ; 字符 t满足 2、 3、 4、 5, 字符 w分值 =1.6。 The character a satisfies 3, 5, and the character a score = 1.1; The character t satisfies 2, 3, 4, 5, and the character w score = 1.6.
英语单词" what"的总分值为 5.9, 各字符的认知权值为:  The English word "what" has a total score of 5.9, and the cognitive weight of each character is:
w的认知权值 =w分值 /"what"总分值 =2/5.9;  The cognitive weight of w = w score / "what" total score = 2 / 5.9;
h的认知权值 =h分值 /"what"总分值 =1.2/5.9;  The cognitive weight of h =h score / "what" total score = 1.2/5.9;
a的认知权值 =a分值 /"what"总分值 =1.1/5.9;  The cognitive weight of a = a score / "what" total score = 1.1 / 5.9;
t的认知权值 =t分值 /"what"总分值 =1.6/5.9;  The cognitive weight of t = t score / "what" total score = 1.6 / 5.9;
最后得到英语单词" what"的认知权值序列: 2/5.9 , 1.2/5.9, 1.1/5.9, 1.6/5.9。 通过上述具体实施方式, 我们可以看出, 与现有基于错误因素的距离计算 相比较, 本发明计算特性匹配度的每一个特性参数的概念独立, 计算出的特性 参数, 更为细致地反映了文本与检索词在各特性参数上的差异。 因此根据三个 特性参数计算出的特性匹配度, 能更为合理地反映匹配状况, 在词典检索结果 的排序上更加符合用户的实际需求。  Finally, the cognitive weight sequence of the English word "what" is obtained: 2/5.9, 1.2/5.9, 1.1/5.9, 1.6/5.9. Through the above specific implementation manners, we can see that compared with the existing error calculation based on error factors, the concept of each characteristic parameter of the calculation characteristic matching degree of the present invention is independent, and the calculated characteristic parameters are more carefully reflected. The difference between text and search terms in each characteristic parameter. Therefore, according to the characteristic matching degree calculated by the three characteristic parameters, the matching condition can be more reasonably reflected, and the order of the dictionary search results is more in line with the actual needs of the user.
同时, 我们还可以看出, 本发明的基于特性参数的字符串匹配方法, 具有 极强的容错检索能力, 适应于词典检索。  At the same time, we can also see that the string matching method based on characteristic parameters of the invention has strong fault-tolerant retrieval capability and is suitable for dictionary retrieval.
尽管上面对本发明说明性的具体实施方式进行了描述, 但应当清楚, 本发 明不限于具体实施方式的范围, 对本技术领域的普通技术人员来讲, 只要各种 变化在所附的权利要求限定和确定的本发明的精神和范围内, 这些变化是显而 易见的, 一切利用本发明构思的发明创造均在保护之列。  While the invention has been described with respect to the preferred embodiments of the present invention, it should be understood that These variations are obvious within the spirit and scope of the invention, and all inventions that utilize the inventive concept are protected.

Claims

权 利 要 求 书 Claim
1、 一种基于特性参数的字符串匹配方法, 给定存储于存储设备中的一个文 本, 以及输入设备输入的检索词, 其特征在于, 信息处理设备对给定的文本与 检索词进行基于特性参数的字符串匹配, 步骤为: A string matching method based on a characteristic parameter, given a text stored in a storage device, and a search term input by the input device, wherein the information processing device performs a characteristic based on the given text and the search term The string matching of the parameters, the steps are:
A步)、 计算文本与检索词中字符的匹配关系;  Step A), calculating the matching relationship between the text and the characters in the search term;
B步)、 根据文本与检索词中字符的匹配关系计算特性参数, 特性参数包括 反映检索词各字符出现在文本中的离散的字符数的离散数, 反映检索词各字符 出现在文本中的交叉的字符数的交叉数, 反映检索词各字符未出现在文本中的 字符数的非完全数;  Step B), calculating a characteristic parameter according to the matching relationship between the text and the characters in the search term, the characteristic parameter includes a discrete number reflecting the number of discrete characters in which the characters of the search word appear in the text, and reflecting the intersection of each character of the search word in the text The number of intersections of the number of characters, reflecting the incomplete number of characters whose characters do not appear in the text;
C步)、 根据特性参数、 文本与检索词的长度计算特性匹配度;  Step C), calculating the characteristic matching degree according to the characteristic parameter, the text and the length of the search term;
D歩)、 输出特性匹配度。  D歩), output characteristic matching degree.
2、 根据权利要求 1所述的一种基于特性参数的字符串匹配方法, 给定存储 于存储设备中的一个文本为 T= "t^— tn", 以及输入设备输入的检索词为 P=2. A string matching method based on a characteristic parameter according to claim 1, wherein a text stored in the storage device is T="t^ -tn ", and the search term input by the input device is P =
" P —Pm", 其中 ti 、 Pj ( l^ i^n, l ^j^m) 为字符, m、 n均大于等于 1, 其特征在于, 所述 A步的计算文本与检索词中字符的匹配关系的具体步骤为: a步)、 稳定排序检索词 P "P - Pm", where ti , Pj ( l^ i^n, l ^j^m) are characters, m and n are both greater than or equal to 1, characterized in that the calculated text of the A step and the character in the search term The specific steps of the matching relationship are: a step), stable sorting search term P
对检索词?=' ^2... ... pm "中的所有字符, 进行稳定的升序排序, 并存储在 内存中数组 PT中, 数组 PT中同时存储了各字符在检索词中原来的位置, 分别 称为数组 PT中存储的字符子数组 PTc以及数组 PT中存储的位置子数组 PTp; b步)、 稳定排序文本 T For the search term? All characters in =' ^ 2 ... ... p m " are sorted in a stable ascending order and stored in the in-memory array PT. The array PT also stores the original position of each character in the search term, respectively It is called the character sub-array PTc stored in the array PT and the position sub-array PTp stored in the array PT; b step), stable sorting text T
对文本 T="W2... ... tn "中的所有字符, 进行稳定的升序排序, 并存储在内存 中数组 WT中, 数组 WT中同时存储了各字符在文本中原有的位置, 分别称为 数组 WT中存储的字符子数组 WTc以及数组 WT中存储的位置子数组 WTp; c步)、 参数初始化 All characters in the text T="W 2 ... t n " are sorted in a stable ascending order and stored in the in-memory array WT. The array WT also stores the original position of each character in the text. , respectively called the character sub-array WTc stored in the array WT and the position sub-array WTp stored in the array WT; c step), parameter initialization
内存中数组 POS用于存储字符对应位置, 全部初始化为 -1, 非完全数 =0, 数组 WT的位置 W=l, 数组 PT的位置 P=l, 最大位置 =0, 最小位置 =n;  In-memory array POS is used to store the corresponding position of the character, all initialized to -1, non-complete number =0, array WT position W=l, array PT position P=l, maximum position =0, minimum position =n;
d步)、 循环是否结束  d step), whether the loop ends
若数组 WT比较结束或者数组 PT比较结束, 则转 f步;  If the array WT comparison ends or the array PT comparison ends, then go to step f;
e步)、 比较 根据 WTc中位置 W存储的字符与 PTc中位置 P存储的字符的比较情况, 分别进行以下情况处理: Step e), compare According to the comparison between the characters stored in the position W in the WTc and the characters stored in the position P in the PTc, the following cases are respectively processed:
若 WTc中位置 W存储的字符 < PTc中位置 P存储的字符,则位置 W增加 1, 转 d步;  If the character stored in the position P of the character < PTc stored in the position W in the WTc is stored, the position W is incremented by 1, and the step d is performed;
若 WTp中位置 W存储的字符 >PTp中位置 Ρ存储的字符,则位置 Ρ增加 1, 非完全数增加 1, 转 d歩;  If the character stored in the position W in the WTp > PTp is stored in the character, the position Ρ is increased by 1, and the non-complete number is increased by 1, and d 歩;
若 WTc中位置 W存储的字符 = PTc中位置 P存储的字符, 则将 WTp中位 置 W存储的数值存储到数组 POS中, 其存储位置为 PTp中位置 Ρ存储的数值; 若 WTp中位置 W存储的数值 >最大位置, 则将 WTp中位置 W存储的数值存入 最大位置中; 若 WTp中位置 W存储的数值 <最小位置, 则将 WTp中位置 W存 储的数值存入最小位置中; 位置 W增加 1, 位置 P增加 1, 转 d步;  If the character stored in the position W in the WTC = the character stored in the position P in the PTc, the value stored in the position W in the WTp is stored in the array POS, and the storage position is the value stored in the position PTp; if the position W in the WTp is stored The value > maximum position, the value stored in the position W in the WTp is stored in the maximum position; if the value stored in the position W in the WTp < the minimum position, the value stored in the position W in the WTp is stored in the minimum position; Increase by 1, position P increases by 1, and turns to step d;
f歩)、 结束  f歩), end
得到代表文本与检索词中字符的匹配关系的数组 POS、 最大位置、 最小位 置、 位置 P以及非完全数。  Get an array POS, maximum position, minimum position, position P, and incomplete number that represent the matching relationship between the text and the characters in the search term.
3、 根据权利要求 2所述的一种基于特性参数的字符串匹配方法, 其特征在 于, 所述 B步的根据文本与检索词中字符的匹配关系计算特性参数的步骤为: a步)、 非完全数= (非完全数 +m-位置 P+1);  The character string matching method based on the characteristic parameter according to claim 2, wherein the step of calculating the characteristic parameter according to the matching relationship between the text and the character in the search word in the step B is: a step) Incomplete number = (non-complete number + m - position P + 1);
b步)、 离散数 =(最大位置-最小位置 +l-m+非完全数 );  Step b), discrete number = (maximum position - minimum position + l-m + non-complete number);
c步)、 交叉数 =根据数组 POS结果进行交叉数计算。  Step c), number of intersections = cross-count calculation based on the POS result of the array.
4、 根据权利要求 3所述的一种基于特性参数的字符串匹配方法, 其特征在 于, 所述 C歩的根据特性参数、文本与检索词的长度计算特性匹配度的步骤为: a步)、 计算检索词与文本的实际匹配字符的相关字符数  The character string matching method based on the characteristic parameter according to claim 3, wherein the step of calculating the characteristic matching degree according to the characteristic parameter, the text and the length of the search term is: a step) , calculating the number of related characters of the actual matching characters of the search term and the text
相关字符数 =2χ (m-非完全数);  Number of related characters = 2χ (m-non-complete number);
b步)、 计算特性参数对特性匹配度的影响因子 1  Step b), calculate the influence factor of the characteristic parameters on the characteristic matching degree 1
影响因子 l= kl X交叉数; Impact factor l = k l X crossing number;
c步)、 计算特性参数对特性匹配度的影响因子 2  Step c), calculate the influence factor of the characteristic parameters on the characteristic matching degree 2
影响因子 2= ql X非完全数 + q2x交叉数 + q3 x离散数; Influence factor 2 = q l X incomplete number + q 2 x cross number + q 3 x discrete number;
d步)、 计算特性匹配度  d step), calculation characteristic matching degree
特性匹配度 = (相关字符数-影响因子 1 ) ÷ (m+n+影响因子 2 ) ; 其中, qi、 q2、 q3是各特性参数在特性匹配度中的权重系数, 为大于 等于零且小于等于 2的实数, qi、 q2、 q3为大于等于零的实数, 权重系数 ld、 qi、 q2、 q3, 可根据不同产品、 不同应用场合选择不同的数值, 从而影响检索出的文 本的特性匹配度, 并影响检索出的文本的排序。 Characteristic matching degree = (number of related characters - influence factor 1) ÷ (m+n+ influence factor 2); Where qi , q 2 , q 3 are the weight coefficients of the characteristic parameters in the characteristic matching degree, which are real numbers greater than or equal to zero and less than or equal to 2, qi , q 2 , q 3 are real numbers greater than or equal to zero, weight coefficients ld, qi , q 2 , q 3 , can choose different values according to different products and different applications, thus affecting the matching degree of the retrieved text and affecting the sorting of the retrieved text.
5、 根据权利要求 4所述的一种基于特性参数的字符串匹配方法, 其特征在 于, 所述的权重系数 qi、 q2、 q3的值为 1^=2/3、 q尸 1、 q2=2/3、 q3=l/3。 The character string matching method based on the characteristic parameter according to claim 4, wherein the value of the weight coefficients qi , q 2 , q 3 is 1^=2/3, q corpse 1, q 2 = 2/3, q 3 = l/3.
6、 根据权利要求 3所述的一种基于特性参数的字符串匹配方法, 其特征在 于, 交叉数计算的步骤为:  6. A character string matching method based on a characteristic parameter according to claim 3, wherein the step of calculating the number of intersections is:
( 1 )、 求最大可间隔升序序列的长度;  (1) Find the length of the maximum intervalable ascending sequence;
( 2 )、 交叉数 =m-非完全数-最大可间隔升序序列的长度。  (2), number of intersections = m - incomplete number - the length of the maximum intervalable ascending sequence.
7、 根据权利要求 6所述的一种基于特性参数的字符串匹配方法, 其特征在 于, 求最大可间隔升序序列的长度的歩骤为:  7. The character string matching method based on a characteristic parameter according to claim 6, wherein the step of finding the length of the maximum interval ascending sequence is:
( 1 )、 初始化  (1), initialization
依次将数组 POS中大于零的全部数值存入内存中的临时数组中, 临时数组 最后添加一个结束标志; 若临时数组中的元素个数等于 0,直接返回最大可间隔 升序序列的长度等于 0的结果; 若临时数组中的元素个数等于 1,直接返回最大 可间隔升序序列的长度等于 1的结果; 否则, 用内存中数组 LPOS存放最大可 间隔升序序列, 且数组 LPOS 的第一个位置的数值初始化为临时数组中第一个 数值; LP用于指示当前数组 LPOS处理的位置, 并初始化为 1 ; 取临时数组中 第二个数值到比较数据中;  In turn, all the values in the array POS greater than zero are stored in the temporary array in the memory, and the temporary array is finally added with an end flag; if the number of elements in the temporary array is equal to 0, the maximum interval of the ascending sequence can be directly returned to be equal to 0. Result; if the number of elements in the temporary array is equal to 1, the result of directly returning the maximum intervalable ascending sequence length is equal to 1; otherwise, the maximum interval ascending sequence is stored in the in-memory array LPOS, and the first position of the array LPOS The value is initialized to the first value in the temporary array; LP is used to indicate the position of the current array LPOS processing, and initialized to 1; take the second value in the temporary array to the comparison data;
(2)、 判断是否结束  (2), judge whether to end
若比较数据为结束标记, 转 (4) 步;  If the comparison data is the end marker, go to step (4);
(3 )、 根据比较进行两种情况处理  (3), according to the comparison, two cases are handled
若比较数据大于数组 LPOS中 LP位置的数据, 则 LP增加 1, 将比较数据 存储到数组 LPOS中 LP位置处,取临时数组中下一个数值到比较数据中,转(2) 步;  If the comparison data is larger than the data at the LP position in the array LPOS, then LP is incremented by 1. The comparison data is stored in the LP position of the array LPOS, and the next value in the temporary array is taken to the comparison data, and the step (2) is performed;
若比较数据小于数组 LPOS中 LP位置的数据, 则从数组 LPOS中第一个位 置向后进行折半查找, 搜索第一个大于比较数据的数据, 并用比较数据改写该 数据; 取临时数组中下一个数值到比较数据中, 转 (2) 歩; (4)、 最大可间隔升序序列的长度 =LP。 If the comparison data is smaller than the data of the LP position in the array LPOS, the second position in the array LPOS is searched backwards, the first data larger than the comparison data is searched, and the data is overwritten with the comparison data; the next one in the temporary array is taken. Value to the comparison data, turn (2) 歩; (4) The length of the maximum intervalable ascending sequence = LP.
8、 根据权利要求 3所述的一种基于特性参数的字符串匹配方法, 其特征在 于, 所述的文本 Τ="Μ2...ίη "中的每一个字符, 对应存储了一个认知权值 w, 组 成了文本的认知权值系列 =、 且满足 Wl+w2+...+wn=l, 认知权值 w^^n)代表了字符在文本" W2...tn "中被认知的概率; 8. The character string matching method based on a characteristic parameter according to claim 3, wherein each of the characters Τ="Μ 2 ... ί η " stores a recognition correspondingly The knowledge weight w, which constitutes the cognitive weight series of the text =, and satisfies Wl + w 2 +... + w n = l, the cognitive weight w^^n) represents the character in the text " W 2 . The probability of being recognized in ..t n ";
所述的 C步的根据特性参数、 文本与检索词的长度计算特性匹配度的步骤 为:  The step of calculating the characteristic matching degree according to the characteristic parameter, the text and the length of the search term in the step C is:
a步)、 计算检索词与文本的实际匹配字符的相关字符数  Step a), calculating the number of related characters of the actual matching characters of the search term and the text
相关字符数 =2x (m-非完全数);  Number of related characters = 2x (m-non-complete number);
b步)、 计算特性参数对特性匹配度的影响因子 1  Step b), calculate the influence factor of the characteristic parameters on the characteristic matching degree 1
影响因子 l=klX交叉数; Influence factor l=k lX cross number;
c步)、 计算特性参数对特性匹配度的影响因子 2  Step c), calculate the influence factor of the characteristic parameters on the characteristic matching degree 2
影响因子 2= qix非完全数+ q2x交叉数 + q3x离散数; Influence factor 2 = qi x non-complete number + q 2 x cross number + q 3 x discrete number;
d步)、 计算文本中已匹配字符的认知权值之和  Step d), calculate the sum of the cognitive weights of the matched characters in the text
根据数组 P0S 中已匹配文本字符的位置, 求出所有已匹配字符的认知权值 之和;  Find the sum of the cognitive weights of all matched characters based on the position of the matched text characters in the array P0S;
e歩)、 特性匹配度 =[ (相关字符数-影响因子 1) ÷ (m+n+影响因子 2) ] X认知权值之和。  e歩), characteristic matching degree = [(related number of characters - influence factor 1) ÷ (m + n + influence factor 2) ] X The sum of cognitive weights.
其中, qi、 q2、 q3是各特性参数在特性匹配度中的权重系数, 为大于 等于零且小于等于 2的实数, qi、 q2、 q3为大于等于零的实数, 权重系数 、 qi、 q2、 q3, 可根据不同产品、 不同应用场合选择不同的数值, 从而影响检索出的文 本的特性匹配度, 并影响检索出的文本的排序。 Where qi , q 2 , q 3 are the weight coefficients of the characteristic parameters in the characteristic matching degree, which are real numbers greater than or equal to zero and less than or equal to 2, qi , q 2 , q 3 are real numbers greater than or equal to zero, weight coefficients, qi , q 2 , q 3 , can choose different values according to different products and different applications, thus affecting the matching degree of the retrieved text and affecting the sorting of the retrieved text.
PCT/CN2008/070603 2007-04-02 2008-03-27 Method for matching character string based on characteristic parameters WO2008119297A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200710048790.0 2007-04-02
CNA2007100487900A CN101030216A (en) 2007-04-02 2007-04-02 Method for matching text string based on parameter characteristics

Publications (1)

Publication Number Publication Date
WO2008119297A1 true WO2008119297A1 (en) 2008-10-09

Family

ID=38715562

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2008/070603 WO2008119297A1 (en) 2007-04-02 2008-03-27 Method for matching character string based on characteristic parameters

Country Status (2)

Country Link
CN (1) CN101030216A (en)
WO (1) WO2008119297A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101030216A (en) * 2007-04-02 2007-09-05 丁光耀 Method for matching text string based on parameter characteristics
CN101345957B (en) * 2008-08-20 2013-01-09 宇龙计算机通信科技(深圳)有限公司 Recognition method, system and mobile terminal for login cipher
CN102682033A (en) * 2011-03-17 2012-09-19 环达电脑(上海)有限公司 Method for querying words by matching binary characteristic values
CN102184195B (en) * 2011-04-20 2014-01-08 北京百度网讯科技有限公司 Method, device and equipment for obtaining similarity between character strings
CN107239500A (en) * 2017-05-03 2017-10-10 成都国腾实业集团有限公司 A kind of character string matching method and system
CN112182313A (en) * 2020-09-30 2021-01-05 国网青海省电力公司 Relay protection setting value name matching method and system
CN112508845A (en) * 2020-10-15 2021-03-16 福州大学 Depth learning-based automatic osd menu language detection method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1158460A (en) * 1996-12-31 1997-09-03 复旦大学 A method for automatic classification and retrieval of cross-lingual corpus
JPH11238068A (en) * 1998-02-20 1999-08-31 Fuji Xerox Co Ltd Text retrieval device
CN1392497A (en) * 2002-07-24 2003-01-22 彭泉 Matching method for large character string
CN1916896A (en) * 2006-09-08 2007-02-21 丁光耀 Matching method based on discrete, intersectional, not complete character string mode
CN101030216A (en) * 2007-04-02 2007-09-05 丁光耀 Method for matching text string based on parameter characteristics

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1158460A (en) * 1996-12-31 1997-09-03 复旦大学 A method for automatic classification and retrieval of cross-lingual corpus
JPH11238068A (en) * 1998-02-20 1999-08-31 Fuji Xerox Co Ltd Text retrieval device
CN1392497A (en) * 2002-07-24 2003-01-22 彭泉 Matching method for large character string
CN1916896A (en) * 2006-09-08 2007-02-21 丁光耀 Matching method based on discrete, intersectional, not complete character string mode
CN101030216A (en) * 2007-04-02 2007-09-05 丁光耀 Method for matching text string based on parameter characteristics

Also Published As

Publication number Publication date
CN101030216A (en) 2007-09-05

Similar Documents

Publication Publication Date Title
CN106202153B (en) A kind of the spelling error correction method and system of ES search engine
CN109829104B (en) Semantic similarity based pseudo-correlation feedback model information retrieval method and system
CN103198079B (en) The implementation method of relevant search and device
CN109960724B (en) Text summarization method based on TF-IDF
CN105653706B (en) A kind of multilayer quotation based on literature content knowledge mapping recommends method
CN106096066B (en) Text Clustering Method Based on Random Neighbor Embedding
CN111104794A (en) Text similarity matching method based on subject words
WO2008119297A1 (en) Method for matching character string based on characteristic parameters
Yerra et al. A sentence-based copy detection approach for web documents
CN105631009A (en) Retrieval method and system based on word vector similarity
JP2012524314A (en) Method and apparatus for data retrieval and indexing
WO2011109251A2 (en) Semantic object characterization and search
WO2008031306A1 (en) A method of characteristic character string matching based on discreteness, cross and non-identical
CN111104488B (en) Method, device and storage medium for integrating retrieval and similarity analysis
EP2162838B1 (en) Phonetic search using normalized string
Clinchant et al. Xrce’s participation in wikipedia retrieval, medical image modality classification and ad-hoc retrieval tasks of imageclef 2010
CN113673252B (en) Automatic join recommendation method for data table based on field semantics
Chen et al. Fine-grained product categorization in e-commerce
CN111611372A (en) Search result sorting method and device and music searching method and device
CN115794998B (en) A method for mining professional terminology based on contrastive learning
CN105354264A (en) Locality-sensitive-hashing-based subject label fast endowing method
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
CN120234386A (en) A retrieval joint optimization method for retrieval enhancement generation system
CN106919565B (en) MapReduce-based document retrieval method and system
CN104615590A (en) Project name extraction method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08715339

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08715339

Country of ref document: EP

Kind code of ref document: A1