WO2008119297A1

WO2008119297A1 - Method for matching character string based on characteristic parameters

Info

Publication number: WO2008119297A1
Application number: PCT/CN2008/070603
Authority: WO
Inventors: Guangyao Ding
Original assignee: Guangyao Ding
Priority date: 2007-04-02
Filing date: 2008-03-27
Publication date: 2008-10-09
Also published as: CN101030216A

Abstract

A method for matching a character string based on characteristic parameters is provided. There are a text that is stored in a given storage device and a searching word. The characteristic parameters include a discrete number, a crossing number and a non-identical number. The method for matching the character string of the characteristic parameters comprises: calculating a matching relationship between the characters in the text and the searching word; calculating the characteristic parameters according to the matching relationship between the characters in the text and the searching word; calculating a characteristic matching degree according to the length of the characteristic parameters, the text, and the searching word; returning the characteristic matching degree.

Description

TECHNICAL FIELD The present invention relates to a string matching method, and in particular to a character string matching method based on a characteristic parameter. Background technique

Say

Dictionary search is the most basic application of string matching technology. The retrieval techniques of existing dictionary retrieval products fall into two categories: retrieval techniques based on exact matching, and retrieval techniques based on inexact matching.

Book

Accurate matching retrieval techniques are not fault-tolerant; rather than exact matching retrieval techniques, which allow for a small number of errors in user input, and therefore have a relatively low application value.

Forty years, domestic and foreign research on methods of inexact string matching has always used distance calculation based on error factors. The most commonly used distances are Levenshtein distance and ED (Edit Distance). The wrong factors affecting distance results mainly include insertion errors and deletion. Errors, replacement errors, exchange errors, etc. This method based on the error factor distance calculation has some inherent problems, which results in the dictionary search results being too general and limited in fault tolerance. The problems are mainly reflected in the following points:

1) The existing research method of inexact string matching based on error factor distance calculation is a research idea based on problem phenomenon, such as insertion, deletion, replacement, exchange, reverse error and so on. These errors are not completely independent, can be strictly defined, and exhibit diverse characteristics, which are typical problem phenomena. For example, it is essentially possible to qualitatively represent a replacement error with an insertion error and a deletion error, and qualitatively indicate an exchange error with an insertion error and a deletion error. Therefore, some error factors are not independent concepts, and string matching has not yet formed a scientific classification system, which is one of the important reasons.

2), the existing polymorphism based on the error factor to describe the string matching problem directly affects the string matching result and the sorting of the retrieval result. Table 1 reflects the polymorphisms that describe the error phenomena of a particular problem based on error factors. Text search terms are based on error factors ABCDEFGH ABCDEFGH exact match (ie substring matching):

Delete, insert, swap, etc. errors are not allowed

ABCDEFGH CDEF exact match (ie substring matching):

Allow front and back deletion

ABCDEFGH CDF non-exact match:

1. There is a deletion (E); or

2, there are a number of front and back deletions,

And there is an intermediate deletion (E); or

3. There is an insertion (F); or

4, there is a replacement (E, F) _; etc.

ABCDEFGH CEDF non-exact match:

1. There is an exchange (DE, ED); or

2. There are two substitutions (D, E), (E, D) _; or

3. There is an insert (D) and a delete (D);

ABCDEFGH CEFD non-exact match:

1. There is a delete (D) and an insert (D); or

2, there are two inserts (; C D); or

3. There are two insertions (E F); etc.

ABCDEFGH ACEFXD Inexact Match:

1, there are two deletes (B) D) and two

Insert (X), (D); or

2, there are two deletes (B), (D) and two

Replace (G, X), (H, D); etc.

Table 1

In Table 1, the quantitative influence of the distance calculation is ignored. Based on the error factor, the matching problem between the specific text and the search term has multiple qualitative representation methods based on the error factor, which reflects the polymorphism describing the same problem and is not convenient for classification processing.

3) Inaccurate string matching method based on error factor distance calculation, because the various error factors are uniformly quantized by distance calculation, such as ED (Edit Distance) distance calculation, so that matching The nature of different error factors is blurred, and the matching results reflected by distance are too general. For example, a distance of 2 means that there are two insertion errors, or two deletion errors, or two replacement errors, or two mixed errors. The non-independence of the concept of error factors and the uncertainty of the nature of the errors in the matching make it impossible to further refine the matching situation according to the wrong factors. Therefore, in the dictionary search, when calculating the matching degree, the more detailed parameter basis of the accurate matching condition is lacking, which is not conducive to the reasonable sorting of the detected results.

4) Existing dictionary searches rarely discuss the impact of dictionary search from the perspectives of psychology, cognition, linguistics, and behavior. In fact, each character has different levels of cognitive differences depending on factors such as position, pronunciation, and visual in the word. Some characters are easy to remember, while others are difficult to remember, or gradually evolve over time. Desalination, which is also the main cause of inaccurate input. Therefore, in the search, the influence of the cognitive difference of each character in the word on the dictionary search result should be considered.

The above problems show that the string matching basic system is not perfect and imperfect, which directly affects the reasonable ordering of the dictionary search results, the fault tolerance is limited, and the matching methods are diverse, but it is difficult to carry out comprehensive application, which needs to be solved urgently. SUMMARY OF THE INVENTION An object of the present invention is to overcome the deficiencies in the prior art described above, and to provide a character string matching method based on characteristic parameters that is more reasonable in sorting retrieval results and has strong fault tolerance.

In order to achieve the above object, a string matching method based on characteristic parameters of the present invention, given a text stored in a storage device, and a search term input by the input device, characterized in that the information processing device gives a given Text and search terms are matched based on string of characteristic parameters. The steps are:

Step A), calculating the matching relationship between the text and the characters in the search term;

Step B), calculating a characteristic parameter according to the matching relationship between the text and the characters in the search term, the characteristic parameter includes a discrete number reflecting the number of discrete characters in which the characters of the search word appear in the text, and reflecting the intersection of each character of the search word in the text The number of intersections of the number of characters, reflecting the incomplete number of characters whose characters do not appear in the text;

Step C), calculating the characteristic matching degree according to the characteristic parameter, the text and the length of the search term;

Step D), output characteristic matching degree.

Compared with the prior art, the beneficial effects of the present invention are: 1. Through Table 2, an example comparison based on the error factor and the description difference table based on the three characteristics of the present invention is clearly illustrated.

Text search term existing description based on error factors The invention is based on the description of three characteristics

ABCDEFGH ABCDEFGH exact match (ie substring match): exact match (ie substring does not allow deletion, insertion, swapping):

Equal error does not allow discrete, cross, incomplete

ABCDEFGH CDEF exact match (ie substring match): exact match (ie substrings only allow front and back deletes):

Discrete, cross, incomplete

ABCDEFGH CDF Inexact Match: Discrete Match:

1. There is a deletion (E); or there is only one discrete (E)

2, there are a number of front, back

Delete and exist in the middle

Delete; or

3. There is an insertion (F); or

4. There is a replacement (E, F);

ABCDEFGH CEDF Inexact Match: Cross Match:

1. There is exchange (DE, ED); there is only one intersection (D) or

2, there are two replacements

(D, E), (E, D);

3, there is insertion (D) and delete

(D); etc.

ABCDEFGH CEFD Inexact Match: Cross Match:

1. There is only one intersection (D) in the presence of deletion (D) and insertion. (D); or

2. There are two insertions (C^D);

Or

3, there are two inserts (Ε), 0

Inch

ABCDEFGH ACEFXD Inexact Match: Discrete Cross Non-Complete

1, there are two deletes (B UD) with:

And two inserts (X), (D); or one discrete (B), one

2, delete in two (B), (D) cross (D), one non-finish and two replacement (G, X), all (X)

(H, D) ; etc.

It can be seen that there are a variety of error-based description methods based on error factors for the matching of specific texts and search terms, which reflects the polymorphism describing the same problem, and is not convenient for detailed classification processing. However, using the three characteristics of the present invention, there is only one description method for the same problem, which accurately reflects the character correspondence between the text and the search term.

For the two substring matching problems in Table 2, there are two different description methods based on the error factors. In the present invention, there are only one description method for the two substring matching problems, which is more consistent with the definition of substrings.

From the comparison of the above table, the difference between discrete characteristics and deletion errors, the difference between cross characteristics and exchange errors, and the difference between incompleteness and insertion errors can be clearly understood.

The three characteristic parameters used in the present invention, namely, discreteness, cross-overness, incompleteness, and completely different concepts and properties, are mutually independent characteristics. The error factor is the external manifestation of the three characteristics. Based on the string matching research ideas of the three characteristic parameters, the inherent law of the string matching problem is more scientifically revealed.

Second, the non-independence of the existing concept of error factors, and the uncertainty of the nature of the errors in the matching, make it impossible to further refine the matching situation according to the wrong factors. Therefore, in the dictionary search, when calculating the matching degree, the more detailed parameter basis of the accurate matching condition is lacking, which is not conducive to the reasonable sorting of the detected results.

The three characteristic parameters adopted by the present invention are: discreteness, crossover, incompleteness, and mutual Different concepts and properties are mutually independent features. By calculating the characteristic matching degree through three characteristic parameters, the influence of the three characteristics on the characteristic matching degree can be separately considered, so that the calculated characteristic matching degree more accurately reflects the degree of similarity between the text and the search term. Therefore, according to the feature matching degree obtained by the invention, the sorting output of all the matching results of the dictionary words is more reasonable, the fault tolerance capability is greatly improved, and the defect of the comprehensive calculation based on the error factor distance calculation result is too general and is not conducive to the matching degree. . DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The string matching method based on characteristic parameters of the present invention will be further described in detail below with reference to specific embodiments.

The electronic English dictionary is an electronic dictionary library composed of English words. The electronic English dictionary search refers to: performing string matching operation on each word in the electronic English dictionary library, that is, the text T according to the input English word, that is, the search word P, and sorting the words satisfying the condition according to the matching result Output, user-friendly choice.

The core technology of the electronic English dictionary is string matching. The matching result directly affects the final sorting position of all detected words, and is also an important indicator for measuring the retrieval effect of electronic English dictionary.

In the present embodiment, a text stored in the storage device is 1 = "^ ·· t _n ", and the search term input by the input device is Ρ = "ΡΦ ₂ ρ ρ _ω ", where ti , _Pj ( l^ i ^n, l^ j ^m) is a character, m, n are greater than or equal to 1, the specific steps of the matching relationship between the calculated text of the A歩 and the characters in the search term are:

a step), stable sorting search term P

For the search term? = ^ ₂ ... ... p _m "all characters in the stable ascending order, and stored in the memory array PT, the array PT also stores the original position of each character in the search term, respectively Is the character sub-array PTc stored in the array PT and the position sub-array PTp stored in the array PT; b step), stable sorting text T

All characters in the text T=" _tl t ₂ ... ... t _n " are sorted in a stable ascending order and stored in the in-memory array WT. The array WT also stores the original characters in the text. Position, respectively called the character sub-array WTc stored in the array WT and the position sub-array WTp stored in the array WT; c step), parameter initialization

The in-memory array POS is used to store the corresponding position of the character, all initialized to -1, non-complete number = 0, The position of the array WT is W=l, the position of the array PT is P=l, the maximum position=0, the minimum position=n;

d step), whether the loop ends

If the array WT comparison ends or the array PT comparison ends, then go to step f;

Step e), compare

According to the comparison between the characters stored in the position W in the WTc and the characters stored in the position P in the PTc, the following cases are respectively processed:

If the character stored in the position P of the character < PTc stored in the position W in the WTc is stored, the position W is incremented by 1, and the step d is performed;

If the character stored in the position W in the WTp > PTp is stored in the character, the position Ρ is increased by 1, and the non-complete number is increased by 1, and d 歩;

If the character stored in the position W in the WTC = the character stored in the position P in the PTc, the value stored in the position W in the WTp is stored in the array POS, and the storage position is the value stored in the position PTp; if the position W in the WTp is stored The value > maximum position, the value stored in the position W in the WTp is stored in the maximum position; if the value stored in the position W in the WTp < the minimum position, the value stored in the position W in the WTp is stored in the minimum position; Increase by 1, position P increases by 1, and turns to step d;

f歩), end

Get an array POS, maximum position, minimum position, position P, and incomplete number that represent the matching relationship between the text and the characters in the search term.

This method of ordering text and search terms, and calculating the matching relationship of characters can improve the calculation speed. The time complexity is: 0(kX log2k), k=Max(m,n);

In another specific embodiment, based on the previous embodiment, the step of calculating the characteristic parameter according to the matching relationship between the text in the B step and the character in the search term is:

a step), incomplete number = (non-complete number + m-position P+1);

Step b), discrete number = (maximum position - minimum position + l-m + non-complete number

Step C), number of intersections = cross-count calculation based on the POS result of the array.

The steps of the foregoing cross number calculation may be:

(1) Find the length of the maximum intervalable ascending sequence;

(2), number of intersections = m - incomplete number - the length of the maximum intervalable ascending sequence.

The foregoing step of determining the length of the maximum intervalable ascending sequence may be: (1), initialization

In turn, all the values in the array POS greater than zero are stored in the temporary array in the memory, and the temporary array is finally added with an end flag; if the number of elements in the temporary array is equal to 0, the maximum interval of the ascending sequence can be directly returned to be equal to 0. Result; if the number of elements in the temporary array is equal to 1, the result of directly returning the maximum intervalable ascending sequence length is equal to 1; otherwise, the maximum interval ascending sequence is stored in the in-memory array LPOS, and the first position of the array LPOS The value is initialized to the first value in the temporary array; LP is used to indicate the position of the current array LPOS processing, and initialized to 1; take the second value in the temporary array to the comparison data;

(2), judge whether to end

If the comparison data is the end marker, go to step (4);

(3), according to the comparison, two cases are handled

If the comparison data is larger than the data at the LP position in the array LPOS, then LP is incremented by 1. The comparison data is stored in the LP position of the array LPOS, and the next value in the temporary array is taken to the comparison data, and the step (2) is performed;

If the comparison data is smaller than the data of the LP position in the array LPOS, the second position in the array LPOS is searched backwards, the first data larger than the comparison data is searched, and the data is overwritten with the comparison data; the next one in the temporary array is taken. Value to the comparison data, turn (2) 歩;

(4) The length of the maximum intervalable ascending sequence = LP.

The array POS is used to store the position of the matched characters in the search word in the matching process. The characteristics of the data in the array P0S are: except for the value -1, other data are integers greater than 0, and are not mutually equal. Since the value -1 represents an unmatched character, the value -1 is not counted in the maximum separable ascending sequence.

The length of the maximum interval-associated sequence of the array P0S is: According to the size of the data in the array P0S, a maximum interval ascending sequence is found in the array POS, and the number of data of the sequence is the length of the maximum interval ascending sequence.

The maximum separable ascending sequence and the strict definition of length are as follows:

Definition: Let any sequence a _ia2 ... a _n ( a _aj ) , each element can be compared, if there is the largest sub-sequence 歹 ij a _kl ak2......

1, l<kl<k2<<km<n and 2, a _k i < ak2 <... <a _km

Then, the subsequence a _kl ak2 . . . is the maximum interval ascending sequence of the sequence _ai a ₂ . . . a _n , and the number of elements m is its length.

For example 7, 8, 9, 1, 2, 6, 3, 4, 12

Maximum interval ascending sequence: 1, 2, 3, 4, 12;

Maximum interval ascending sequence length: 5

By the length of the maximum intervalable ascending sequence, the number of intersections when the text matches the search term can be found. The role of the number of intersections is to match the degree of matching of discrete numbers and incomplete numbers to facilitate the sorting of texts to meet the user's retrieval requirements.

The time complexity of the algorithm is: O(mlo _g2 (m .

In another specific embodiment, based on the foregoing A step, the specific embodiment, the step of calculating the characteristic matching degree according to the characteristic parameter, the text, and the length of the search term of the C歩 is:

Step a), calculating the number of related characters of the actual matching characters of the search term and the text

Number of related characters = 2χ (m-non-complete number);

Step b), calculate the influence factor of the characteristic parameters on the characteristic matching degree 1

Impact factor l = k _{l X} crossing number;

Step c), calculate the influence factor of the characteristic parameters on the characteristic matching degree 2

Influence factor 2 = _qi x non-complete number + q ₂ x cross number + q ₃ x discrete number;

d step), calculation characteristic matching degree

Characteristic matching degree = (related number of characters - impact factor 1) ÷ (m+n+ influence factor 2);

Where _qi , q ₂ , and q ₃ are the weight coefficients of the characteristic parameters in the characteristic matching degree, which are real numbers greater than or equal to zero and less than or equal to 2, and _qi , q ₂ , and q ₃ are real numbers greater than or equal to zero, and the weight coefficient! ^, _qi , q ₂ , q ₃ , can choose different values according to different products and different applications, thus affecting the matching degree of the retrieved text and affecting the sorting of the retrieved text. In a specific application, the weight coefficients ki qi, q ₂ , q ₃ have values of 1^=2/3, q corpse 1, q ₂ =2/3, q ₃ =l/3.

The introduction of impact factors aims to comprehensively consider the influence weights of different characteristic parameters on the feature matching degree according to different products and different application environments, so that the ranking of search results is more in line with the user requirements of special environment applications.

In this embodiment, the characteristic matching degree is a real number that satisfies zero or more and less than or equal to 1. Example one

The following is the search results and characteristics of the electronic English dictionary search example designed according to the specific implementation methods of the above eight, B, and C steps.

In the electronic English dictionary library of this example, six thousand common English words are selected; the weight coefficients ki, _qi , q ₂ , q _{3 are} selected as: 1^=2/3, _qi =l, q ₂ =2/3, q ₃ = l/3; The search results are sorted in descending order of the calculated characteristic matching, and only the first five words are output.

1, discrete search

You can arbitrarily omit characters from words when you type in English words.

For example: The target word is: "wonderful"

Enter the characters as: "wdfl"

Search results for: 1 wonderful 2 handful 3 unfold 4 wind 5 windy Matching degree: 0.546 0.487 0.444 0.375 0.343

2, cross search

You can arbitrarily cross the characters in a word when you type in English words.

For example: The target word is: "what"

Enter the characters as: "whta"

Search results for: l what 2 wheat 3 watch 4 hat 5 white

Characteristic matching degree: 0.846 0.733 0.625 0.615 0.581

3, allow non-complete characters

When you type an English word, the wrong character is allowed.

For example: the target word is: "error"

The input characters are: irror'

Search results for: 1 2 error 3 terror 4 dancing territory Characteristic matching: 0.909 0.727 0.667 0.636 0.622

4, a comprehensive example

When you type English words, you can discrete, cross, and not completely mix.

For example: the target word is: "marvelous"

The input character is: "mvrilus"

Search results for: 1 marvelous 2 various 3 survival 4 minus 5 visual Characteristic matching degree: 0.607 0.600 0.536 0.522 0.520 When inputting, a, e, o are omitted, there is an error character, and there is an intersection (v, r).

5, special examples

For example: the target word is: "marvelous"

Enter the characters as: "mrxxxxxxlus"

The search results are as follows: 1 marvelous 2 muscular 3 marxist 4 marxism 5 luxurious characteristic matching degree: 0.366 0.317 0.312 0.312 0.302 In another specific embodiment, a string matching method based on characteristic parameters of the present invention, the text T Each character in ="W ₂ ... t _n " stores a cognitive weight w, which constitutes the cognitive weight series W="w _lW2 ...w _n " of the text, and satisfies Wi +wz+. - .+w^l , the cognitive weight w^ ^n) represents the probability that the character is recognized in the text "...t _n ";

Based on the foregoing specific embodiments of step A and step B, the step of calculating the characteristic matching degree according to the characteristic parameter, the text and the length of the search term of the C歩 is:

Number of related characters = 2x (m-non-complete number);

Impact factor l = k _{l X} crossing number;

Impact factor 2 = incomplete number + q ₂ x crossing number + q ₃ x discrete number;

Step d), calculate the sum of the cognitive weights of the matched characters in the text

Find the sum of the cognitive weights of all matched characters based on the position of the matched text characters in the array P0S;

e歩), characteristic matching degree = [(related number of characters - influence factor 1) ÷ (m + n + influence factor 2 ) ] X The sum of cognitive weights.

Where _qi , q ₂ , q ₃ are the weight coefficients of the characteristic parameters in the characteristic matching degree, which are real numbers greater than or equal to zero and less than or equal to 2, _qi , q ₂ , q ₃ are real numbers greater than or equal to zero, weight coefficients, _qi , q ₂ , q ₃ , can choose different values according to different products and different applications, thus affecting the matching degree of the retrieved text and affecting the sorting of the retrieved text.

The introduction of impact factors is based on different products and different application environments. The characteristic parameters affect the feature matching degree, so that the sorting of the search results is more in line with the user requirements of the special environment application.

The above method of increasing the influence of cognitive weight on the degree of characteristic matching is an improvement on the calculation of the characteristic matching degree, which conforms to the actual cognitive difference of special symbols, and integrates psychology, behavioral science, linguistics, statistics, etc. Multidisciplinary cognitive thinking, especially suitable for dictionary retrieval. The weight of characters that are easy to recognize is strengthened, and the weight of characters that are prone to errors is diluted, and the degree of matching of the characteristics is calculated, and the order of the detected results is more in line with the requirements of the user.

The main factors determining the cognitive weight are:

1) whether the character is the first position of the word;

2) Whether the character is the first character of each syllable of the pronunciation;

3) Whether the character is normalized in the pronunciation of the syllable, or whether the effect is obvious;

4) Whether the character is visually obvious in the word;

5) Whether the character is a character or the like in the word.

Example two

According to the specific implementation method of the character string matching method based on the characteristic parameter with the cognitive weight, the present example is added to the electronic English dictionary search with the cognitive weight. The electronic English dictionary is an electronic dictionary library consisting of English words and cognitive weights.

The method of the second embodiment is basically the same as the method of the first example. The only difference is that each character in the text T stores a cognitive weight correspondingly, which increases the influence of the cognitive weight on the matching degree of the computing characteristic.

Here's a way to calculate cognitive weights:

1. Whether the character is the first position of the word, the score is 0.4;

2. Whether the character is the first character of each syllable of the pronunciation, the score is 0.3;

3. Whether the character is normalized in the pronunciation of the syllable or whether it has a significant effect, the score is 0.1;

4. Whether the character is visually obvious in the word, the score is 0.2;

5, is the character in the word, the score is 1.

For example: Consider the English word "what".

The character w satisfies 1, 2, 3, 4, 5, character w score = 2;

The character h satisfies 4, 5, and the character h score = 1.2;

The character a satisfies 3, 5, and the character a score = 1.1; The character t satisfies 2, 3, 4, 5, and the character w score = 1.6.

The English word "what" has a total score of 5.9, and the cognitive weight of each character is:

The cognitive weight of w = w score / "what" total score = 2 / 5.9;

The cognitive weight of h =h score / "what" total score = 1.2/5.9;

The cognitive weight of a = a score / "what" total score = 1.1 / 5.9;

The cognitive weight of t = t score / "what" total score = 1.6 / 5.9;

Finally, the cognitive weight sequence of the English word "what" is obtained: 2/5.9, 1.2/5.9, 1.1/5.9, 1.6/5.9. Through the above specific implementation manners, we can see that compared with the existing error calculation based on error factors, the concept of each characteristic parameter of the calculation characteristic matching degree of the present invention is independent, and the calculated characteristic parameters are more carefully reflected. The difference between text and search terms in each characteristic parameter. Therefore, according to the characteristic matching degree calculated by the three characteristic parameters, the matching condition can be more reasonably reflected, and the order of the dictionary search results is more in line with the actual needs of the user.

At the same time, we can also see that the string matching method based on characteristic parameters of the invention has strong fault-tolerant retrieval capability and is suitable for dictionary retrieval.

While the invention has been described with respect to the preferred embodiments of the present invention, it should be understood that These variations are obvious within the spirit and scope of the invention, and all inventions that utilize the inventive concept are protected.

Claims

Claim

A string matching method based on a characteristic parameter, given a text stored in a storage device, and a search term input by the input device, wherein the information processing device performs a characteristic based on the given text and the search term The string matching of the parameters, the steps are:

D歩), output characteristic matching degree.

2. A string matching method based on a characteristic parameter according to claim 1, wherein a text stored in the storage device is T="t^ _-tn ", and the search term input by the input device is P =

"P - Pm", where ti , _Pj ( l^ i^n, l ^j^m) are characters, m and n are both greater than or equal to 1, characterized in that the calculated text of the A step and the character in the search term The specific steps of the matching relationship are: a step), stable sorting search term P

For the search term? All characters in =' ^ ₂ ... ... p _m " are sorted in a stable ascending order and stored in the in-memory array PT. The array PT also stores the original position of each character in the search term, respectively It is called the character sub-array PTc stored in the array PT and the position sub-array PTp stored in the array PT; b step), stable sorting text T

All characters in the text T="W ₂ ... t _n " are sorted in a stable ascending order and stored in the in-memory array WT. The array WT also stores the original position of each character in the text. , respectively called the character sub-array WTc stored in the array WT and the position sub-array WTp stored in the array WT; c step), parameter initialization

In-memory array POS is used to store the corresponding position of the character, all initialized to -1, non-complete number =0, array WT position W=l, array PT position P=l, maximum position =0, minimum position =n;

d step), whether the loop ends

Step e), compare According to the comparison between the characters stored in the position W in the WTc and the characters stored in the position P in the PTc, the following cases are respectively processed:

f歩), end

The character string matching method based on the characteristic parameter according to claim 2, wherein the step of calculating the characteristic parameter according to the matching relationship between the text and the character in the search word in the step B is: a step) Incomplete number = (non-complete number + m - position P + 1);

Step b), discrete number = (maximum position - minimum position + l-m + non-complete number);

The character string matching method based on the characteristic parameter according to claim 3, wherein the step of calculating the characteristic matching degree according to the characteristic parameter, the text and the length of the search term is: a step) , calculating the number of related characters of the actual matching characters of the search term and the text

Number of related characters = 2χ (m-non-complete number);

Impact factor l = k _{l X} crossing number;

Influence factor 2 = q _{l X} incomplete number + q ₂ x cross number + q ₃ x discrete number;

d step), calculation characteristic matching degree

Characteristic matching degree = (number of related characters - influence factor 1) ÷ (m+n+ influence factor 2); Where _qi , q ₂ , q ₃ are the weight coefficients of the characteristic parameters in the characteristic matching degree, which are real numbers greater than or equal to zero and less than or equal to 2, _qi , q ₂ , q ₃ are real numbers greater than or equal to zero, weight coefficients ld, _qi , q ₂ , q ₃ , can choose different values according to different products and different applications, thus affecting the matching degree of the retrieved text and affecting the sorting of the retrieved text.

The character string matching method based on the characteristic parameter according to claim 4, wherein the value of the weight coefficients _qi , q ₂ , q ₃ is 1^=2/3, q corpse 1, q ₂ = 2/3, q ₃ = l/3.

6. A character string matching method based on a characteristic parameter according to claim 3, wherein the step of calculating the number of intersections is:

(1) Find the length of the maximum intervalable ascending sequence;

7. The character string matching method based on a characteristic parameter according to claim 6, wherein the step of finding the length of the maximum interval ascending sequence is:

(1), initialization

(2), judge whether to end

If the comparison data is the end marker, go to step (4);

(3), according to the comparison, two cases are handled

If the comparison data is smaller than the data of the LP position in the array LPOS, the second position in the array LPOS is searched backwards, the first data larger than the comparison data is searched, and the data is overwritten with the comparison data; the next one in the temporary array is taken. Value to the comparison data, turn (2) 歩; (4) The length of the maximum intervalable ascending sequence = LP.

8. The character string matching method based on a characteristic parameter according to claim 3, wherein each of the characters Τ="Μ ₂ ... ί _η " stores a recognition correspondingly The knowledge weight w, which constitutes the cognitive weight series of the text =, and satisfies _Wl + w ₂ +... + w _n = l, the cognitive weight w^^n) represents the character in the text " W ₂ . The probability of being recognized in ..t _n ";

The step of calculating the characteristic matching degree according to the characteristic parameter, the text and the length of the search term in the step C is:

Number of related characters = 2x (m-non-complete number);

Influence factor l=k _lX cross number;

e歩), characteristic matching degree = [(related number of characters - influence factor 1) ÷ (m + n + influence factor 2) ] X The sum of cognitive weights.