AU2018102145A4 - Method of establishing English geographical name index and querying method and apparatus thereof - Google Patents
Method of establishing English geographical name index and querying method and apparatus thereof Download PDFInfo
- Publication number
- AU2018102145A4 AU2018102145A4 AU2018102145A AU2018102145A AU2018102145A4 AU 2018102145 A4 AU2018102145 A4 AU 2018102145A4 AU 2018102145 A AU2018102145 A AU 2018102145A AU 2018102145 A AU2018102145 A AU 2018102145A AU 2018102145 A4 AU2018102145 A4 AU 2018102145A4
- Authority
- AU
- Australia
- Prior art keywords
- geographical name
- english
- refers
- geographical
- index
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/907—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/909—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Library & Information Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Disclosed is a method of querying an English geographical name dictionary, which belongs to the field of natural language processing. The method of querying an English geographical name dictionary based on an inverted index of feature statistics is obtained by performing 5 geographical name query with text features such as the total number of letters, the number of letter radicals, the total number of words and word initial codes in a geographical name according to a main line of "statistics of multi-dimensional features - generation of an inverted index - query of candidate geographical names - sorting of similarities". This method not only maintains a high operating efficiency in a large-scale data environment, but also obtains a target 0 geographical name accurately in the case of an inaccurate description of a queried geographical name. In this way, the users will have better experiences.
Description
) METHOD OF ESTABLISHING ENGLISH GEOGRAPHICAL NAME INDEX AND > QUERYING METHOD AND APPARATUS THEREOF
I •
) TECHNICAL FIELD
I [0001] The present disclosure relates to the field of natural language processing, and in particular to a method of querying an English geographical name dictionary oriented to largeI scale geographical name data.
I BACKGROUND ; [0002] Query of a geographical name dictionary is a basic operation of applications such as > geographical name spelling checking, fuzzy matching and optical recognition, and provides ' 0 knowledge support for geographical names. With the accelerated globalization process, geographical names are transmitted internationally in a increasingly high speed and used in a increasingly high frequency. English, as one of widely used languages in the world, is usually used as a standard for translation, storage and management of geographical names between different languages or characters. Meanwhile, with explosive growth of data and rapid development of information storage technology, a large-scale geographical name data set becomes increasingly popular. Therefore, how to efficiently query an English geographical name dictionary in a large-scale data environment becomes an important technical challenge to improve numerous geographical name services and applications.
[0003] In a conventional method of querying a dictionary, a query record is generally obtained by a sequential traversal or binary search method. However, it is very difficult for the method to satisfy an actual need when applied in massive data since its operation efficiency has a linear relationship with a data size. An inverted file, as a simple and efficient document data indexing manner, is a basic technology to realize a modern search engine retrieval system, and therefore is gradually introduced into a dictionary query mechanism. Word-level indexing is a general 25 organization manner for the inverted file to realize a phrase or proximity query, where an Ngram index is the commonest word-level indexing structure. Although the N-gram structure improves a recall rate of the query to a certain extent, lemmas generated by the N-gram usually greatly increases spatial resource occupation, resulting in reduced speeds of construction processing and query processing. In addition, similarity calculation is required when the fuzzy 30 query is performed with an index term formed in a morpheme form, and similarity comparison is performed for each index term and a query condition. Such query mode greatly increases complexity of an operation mechanism, and application requirements of the large-scale data environment are very difficult to satisfy.
) [0004] Therefore, to satisfy the actual application requirements of different scenarios, how to > efficiently return a completely accurate or the closest query result in a case of an inaccurate and
I > incomplete input during query of an English geographical name is to be studied and solved by ) those skilled in the art at present.
I * SUMMARY [0005] Technical problems )
[0006] The technical problems to be solved by the present disclosure include: how to I efficiently return a completely accurate or the closest query result in a case of an inaccurate and
J incomplete input during query of an English geographical name, and other technical problems * relevant to the technical problem.
' 0 [0007] Summary [0008] The summary of technical contribution contents is as follows: the key of improving query performance is to extract text features contained in the English geographical name and combine the text features with the dictionary query mechanism. According to the present disclosure, a method of querying an English geographical name dictionary based on an inverted 5 index of feature statistics is disclosed by performing geographical name query with the text features such as the total number of letters, the number of letter radicals, the total number of words and word initial codes of the geographical name according to the main line of “statistics of multi-dimensional features - generation of an inverted index - query of candidate geographical names - sorting of similarities”.
[0009] Technical solutions [0010] According to a first aspect, the present disclosure provides a method of establishing an English geographical name index. The method may be applied to a user device, including: amounting a plurality of feature values of all English geographical name phrases stored in text of English geographical name dictionary, wherein the feature value comprises the total number 25 of letters, the number of letter radicals, the total number of words and word initial codes;
b) generating a group of corresponding multi-dimensional feature statistical vectors according to the different feature values of the English geographical name phrases; and
c) establishing an inverted index file by using the multi-dimensional feature statistical vectors of different English geographical name phrases and their position mapping information in an inverted list as index terms, wherein each of the index terms corresponds to one inverted chain respectively.
[0011] The process and principle of the above method of establishing an English geographical name index are detailed below.
2018102145 12 Oct 2018 [0012] Firstly, for statistics of feature values, the feature values of all geographical name phrases stored in the English geographical name dictionary are sequentially counted. The statistics of the feature values include: the total number of letters, the number of letter radicals, the total number of words and word initial codes. (1) The total number of letters refers to a sum of all letters included in a geographical name phrase. (2) The letter radicals refer to that each
English letter is assumed to be comprised of partial radicals of six radicals, i.e., “|” , “— ” , “I” , , “ (” and “)” according to a hieroglyphic thought of Chinese characters, where radical expressions of different letters are shown in the following Table 1. Obviously, two character strings are more similar if the two characters have more same 0 characters. However, if there are more English letters, an appearance frequency of each letter recorded in the index term will occupy excessive storage space and is not conducive to comparison of character strings. Therefore, when each letter is expressed by fixed radicals, comparison complexity of query may be simplified under the precondition of implicitly recording features of the appearance frequency of the letter. (3) The total number of words 5 refers to a sum of all words included in the geographical name phrase. (4) The word initial codes refer to that the initial letters of words in the geographical name phrase are converted into digital codes. A conversion rule is as follows: mapping the codes of “01” to “26” in a sequence of A to Z respectively, that is, the code of A is “01”, the code of B is “02”, and so on. In a code conversion process, the initial letters are converted into capital letters in a unified way.
| letter | radical | letter | radical | letter | radical | letter | radical |
| A | 423 | B | 166 | C | 5 | D | 16 |
| E | 1222 | F | 122 | G | 51 | H | 121 |
| I | 212 | J | 213 | K | 143 | L | 12 |
| M | 1313 | N | 131 | 0 | 56 | P | 116 |
| Q | 563 | R | 1163 | s | 526 | T | 21 |
| u | 151 | V | 34 | w | 1414 | X | 643 |
| Y | 341 | Z | 242 | a | 55 | b | 16 |
| c | 5 | d | 51 | e | 255 | f | 5162 |
| g | 516 | h | 126 | i | 2212 | j | 2216 |
| k | 243 | 1 | 125 | m | 1313 | n | 131 |
| 0 | 56 | P | 116 | q | 514 | r | 126 |
| s | 526 | t | 215 | u | 1515 | v | 34 |
| w | 1414 | x | 643 | y | 3415 | z | 242 |
Table 1 [0013] The radical ‘ j” is indicated by the number 1, the radical “—” is indicated by the number
2, the radical is indicated by the number 3, the radical “\” is indicated by the number 4, the radical “(” is indicated by the number 5, and the radical “)” is indicated by the number 6. [0014] Further, for the composition of the index term, the total number of letters, the number of letter radicals, the total number of words, the world initial codes and inverted list position information are sequentially recorded in each index term in an index dictionary respectively. In the formula, fcn refers to the total number of letters for recording a 1-dimensional vector; far refers to the number of letter radicals for recording a 6-dimensional vector, where there is number information of a total of six radicals;/;,,, refers to the total number of words for recording 5 the 1-dimensional vector; f™refers to the word initial codes for recording a 4-dimensional vector, where information of the word initial codes of the first four words of the geographical name phrase is recorded in this method, and the vacancy resulting from less than four words is complemented with the codes “00”. These vectors are combined according to the formulas (1), (2) and (3) to establish a 12-dimensional vector di. The vector di, as an index term, fully 0 represents the text features of the character strings of the English geographical name, and this is taken as an entrance of querying the English geographical name.
| di ~ \.fcn>far>fwn>ftw\ | (1) |
| ar \farl’ far2> | (2) |
| iw fiw2> > fiw4·] | (3) |
[0015] Further, for construction of an inverted chain file, each index term appearing in the dictionary corresponds to one inverted chain, and the inverted chain records hit information of the index term in the geographical name dictionary by using a data structure (tf, <pl, p2, ..., pf>) of one document hit record. In the structure, tf refers to the number of appearances of the index term in the geographical name dictionary, and pi refers to position offset information of each appearance in the geographical name dictionary. All pieces of hit information are orderly arranged to form their corresponding inverted chains.
[0016] According to a second aspect, the present disclosure further provides a method of querying an English geographical name. The method is applied to a user device, including: obtaining a search keyword input by a user on the user device; querying a candidate geographical name set relating to the search keyword according to an index file pre-established in an English geographical name database, wherein the index file stored on the user device is constructed according to the method of establishing an English geographical name index of first 25 aspect; and returning the candidate geographical name set to the user device for display.
[0017] Process and principle of the above method of querying an English geographical name are detailed below.
[0018] A process of selecting a candidate geographical name set is described below.
[0019] Firstly, standardization processing is performed for a submitted queried geographical 30 name, that is, words in the geographical name phrase are converted into a form with initial letters being capitalized.
[0020] Secondly, according to a feature statistical rule during index construction, different feature values of the queried geographical name are counted, and organized into a vector form, ) which is denoted as Q=[qfCn, qfar,qfi™, qfiw].
> [0021] Thirdly, Q and the index term in the index dictionary are compared. The index term di
I > is a candidate term when the formula (4) is satisfied.
• ) ί II fen [fen * (1 — kcn),fcn X (1 + kcn)] | I qfar [far X (1 — kar),far X (1 + kar)] | ifwn [fwn X (1 — kwn)> fwn X (1 + ^wn)]
Qfiw [fiw X (1 - kiw),fiw X (1 + /tiw)] ! [0022] In the above formula, fcn refers to the total number of letters, far refers to the number of j 5 letter radicals, fwn refers to the total number of words, and fiw refers to the initial codes; kcn refers ( to a dimensional threshold of the total number of letters, kar refers to a dimensional threshold of
J the number of letter radicals, kwn refers to a dimensional threshold of the total number of words,
I and kiw refers to a dimensional threshold of the initial codes.
[0023] Fourthly, index information in di is reversely analyzed, and geographical name data at 0 a relevant storage position in the geographical name dictionary is obtained according to the corresponding position offset information <pi, p2, ..., pf> in the inverted chain. Result combination is performed for all of the obtained geographical name data to form a candidate geographical name set.
[0024] In some preferred solutions, the above method of querying an English geographical name may further include the following blocks: after the candidate geographical name set is obtained, calculating a similarity value of each English geographical name in the candidate geographical name set and the search keyword; and sorting the English geographical names in the candidate geographical name set in a descending sequence of the similarity values, and returning a sorting result to the user device for display.
[0025] A process of sorting sequence similarities of the candidate geographical name sets is described below.
[0026] Firstly, the sequence similarities are calculated, that is, the sequence similarities of all geographical name phrases in the candidate geographical name set and the queried geographical name are calculated. It is assumed that there are two geographical name character strings, i.e., 25 P=pip2.. .pn and W=wiw2.. .wm, N refers to characters with a same sequence between P and W.
The same sequence of N is determined based on two principles. (1) A principle of a locally same sequence. N is formed by a locally similar term Isi, and a plurality of Isi may exist between P and W. If a sub-string qi=PjPj+i...pk exists in P, and is completely same as a sub-string WsWs+i...wt in W, Isi conforms to the principle of the locally same sequence, and is set to one 30 locally similar term. (2) A principle of a totally same sequence. N is formed by Isi with the same sequence between P and W. The geographical name similarity of P and W is calculated based on the following formula (5).
(5) j len(lV) ; Sim(P-M/) = max(len(P),lenm) 1 [0027] In the above formula, sim (P, W) refers to a sequence similarity value between P and
J W, and len (N), len (P) and len (W) refer to character string length values of N, P and W
I respectively.
[0028] Secondly, the sequence similarities are sorted. The candidate geographical names are ( 5 sorted based on the sequence similarities, and a sorting result is returned as a final query result ( to the user.
| [0029] Technical effects [0030] In the present disclosure, the geographical name is queried according to a main line of > “ statistics of multi-dimensional features - generation of an inverted index - query of candidate ' 0 geographical names - sorting of similarities” by summarizing the multi-dimensional text statistical features such as the number of words and the number of letters in the geographical name. In the process of index generation, the features including the total number of letters, the number of letter radicals, the total number of words and word initial codes in each geographical name record are extracted, and a corresponding inverted index structure is constructed with a 5 vector formed by multi-dimensional features as the index term. In the process of searching candidate geographical names and sorting sequence similarities, standardization processing and multi-dimensional feature extraction are performed for a query request, the candidate geographical name sets are obtained by querying the inverted index according to the generated feature vector, and the candidate sets are returned to the user in a descending sequence of the 0 similarities. It is proved by experiments that the method of querying an English geographical name dictionary based on inverted index of feature statistics according to the examples of the present disclosure not only maintains a high operating efficiency in a large-scale data environment, but also obtains a target geographical name accurately in the case of an inaccurate description of a queried geographical name. In this way, the users will have better experiences.
BRIEF DESCRIPTION OF THE DRAWINGS [0031] FIG. 1 is a flowchart illustrating a method of establishing an English geographical name index according to an example of the present disclosure.
[0032] FIG. 2 is a flowchart illustrating a method of querying an English geographical name according to an example of the present disclosure.
[0033] FIG. 3 is a flowchart illustrating a method of querying an English geographical name according to a preferred example of the present disclosure.
[0034] FIG. 4 is a principle diagram illustrating an apparatus for querying an English > geographical name according to an example of the present disclosure.
> [0035] FIG. 5 is a principle diagram illustrating an apparatus for querying an English
I > geographical name according to a preferred example of the present disclosure.
• ) [0036] FIG. 6 is a graphical flowchart illustrating a method of querying an English
I 5 geographical name dictionary according to an example of the present disclosure.
[0037] FIG. 7 is a schematic diagram illustrating an inverted index structure in a method of » establishing an English geographical name index according to an example of the present ( disclosure.
I [ DETAILED DESCRIPTION OF THE EMBODIMENTS > 0 [0038] Implementations of the present disclosure will be described by specific examples.
' Those skilled in the art may easily understand other advantages and effects of the present disclosure based on contents disclosed in the specification.
[0039] Explanations of technical names [0040] Letter radicals refer to that the composition of capital or small letters is described by using six characters (i.e., radicals) of “—”, “\”, “(” and “)”, that is, any capital or small letter is comprised of partial characters of the six characters. If the six characters of “—”, “\”, “(” and “)” are numbered as 1-6 in sequence respectively, the radicals of any letter may be indicated by a string of numbers. For example, “L” may be comprised of and “—”, so that the radicals of the “L” are indicated by the number “12”.
[0041] The number of letter radicals refers to that the number of all radicals are counted after the radicals of all letters in the English geographical name are indicated by numbers corresponding to the radicals (it is required that the initial letter of each word in the English geographical name is a capital letter).
[0042] The letter codes refer to that the initial letters of words in the geographical name phrase 25 are converted into digital codes. The conversion rule is to map codes from “01” to “26” in a sequence of A to Z, that is, the code of A is “01”, the code of B is “02”, and so on.
[0043] Example 1 [0044] As shown in FIG. 1, this example provides a method of establishing an English geographical name index. The method is applied to a user device, and the method includes the 30 following blocks:
[0045] SH, counting a plurality of feature values of all English geographical name phrases stored in text of English geographical name dictionary, wherein the feature value comprises the total number of letters, the number of letter radicals, the total number of words and word initial codes;
2018102145 12 Oct 2018 [0046] S12, generating a group of corresponding multi-dimensional feature statistical vectors according to the different feature values of the English geographical name phrases; and [0047] SI3, establishing an inverted index file by using the multi-dimensional feature statistical vectors of different English geographical name phrases and their position mapping 5 information in an inverted list as index terms, wherein each of the index terms corresponds to one inverted chain respectively.
[0048] Specifically, the multi-dimensional feature statistical vector is:
d-i = [fen> far> fwn> fiwl/herein di refers to the multi-dimensional feature statistical vector of the English geographical name, fcn refers to the total number of letters, far refers to the number 0 of letter radicals, fwn refers to the total number of words, fiw refers to initial codes, the far comprises number information of six radicals, and the fiw comprises initial code information of the first four words in the English geographical name phrase.
[0049] Specifically, the method of establishing an English geographical name index may further include: when an English geographical name is queried according to the index term, 5 comparing a search keyword with the index term; when the index term satisfies the following condition, taking the index term as a candidate term of query, [0050] wherein the condition comprises:
hfen [fcn X (1 — ^cn)> fcn X (1 + ^cn)]
I far θ [far X (1 — kar)>far X (1 3 kar)] fwn [fwn X (1 - ^wn)> fwn X (1 + ^wn)] qfiw [fiw X (1 - kiw)>fiw X (1 + kjw)] [0052] wherein qfcn refers to the total number of letters in the search keyword, qfar refers to 20 the number of letter radicals in the search keyword, qfwn refers to the total number of words in the search keyword, qfiw refers to the initial codes in the search keyword, kcn refers to a dimension threshold of the total number of letters, kar refers to a dimension threshold of the number of letter radicals, kwn refers to a dimension threshold of the total number of words, and kiw refers to a dimension threshold of the initial codes.
[0053] Example 2 [0054] As shown in FIG. 2, this example provides a method of querying an English geographical name. The method is applied to a user device, and the method includes the following blocks:
[0055] S21, obtaining a search keyword input by a user on the user device;
[0051] ) [0056] S22, querying a candidate geographical name set relating to the search keyword > according to an index file pre-established in an English geographical name database, wherein
I > the index file stored on the user device is constructed according to the method of establishing •
) an English geographical name index of example 1; and
I 5 [0057] S23, returning the candidate geographical name set to the user device for display.
[0058] As a preferred example, as shown in FIG. 3, after the candidate geographical name set ! is obtained, the method of querying an English geographical name may further include:
I [0059] S31, calculating a similarity value of each English geographical name in the candidate J geographical name set and the search keyword; and ) 0 [0060] S32, sorting the English geographical names in the candidate geographical name set
I in a descending sequence of the similarity values, and returning a sorting result to the user device for display.
[0061] Specifically, the similarity value of each English geographical name in the candidate geographical name set and the search keyword is calculated based on the following formula:
[0062] sim(P, VF) =--- 'θ'.
L J v 7 max(len(P),len(lV)) [0063] wherein P refers to a character string of the search keyword, W refers to a character string of the English geographical name, sim (P, W) refers to a sequence similarity value between P and W, len (N), len (P) and len (W) refer to character string length values of N, P and W respectively, and N refers to characters with a same sequence between P and W.
[0064] Example 3 [0065] As shown in FIG. 4, this example provides an apparatus 300 for querying an English geographical name. The apparatus 300 is applied to a user device, and the apparatus 300 specifically includes a receiving module 310, a querying module 320 and a displaying module 330. The receiving module 310 is configured to obtain a search keyword input by a user on the 25 user device. The querying module 320 is configured to search a candidate geographical name set relating to the search keyword according to an index file pre-established in an English geographical name database, where the index file stored on the user device is constructed according to the method of establishing an English geographical name index of claim 1 or 2.
The displaying module 330 is configured to display the candidate geographical name set 30 returned to the user device.
[0066] In a preferred solution, as shown in FIG. 5, the apparatus for querying an English geographical name further includes a similarity calculating module 410 and a sorting module 420. The similarity calculating module 410 is configured to calculate a similarity value of each
English geographical name in the candidate geographical name set and the search keyword after obtaining the candidate geographical name set. The sorting module 420 is configured to sort
English geographical names in the candidate geographical name set in a descending sequence of the similarity values, and return a sorting result to the user device. The displaying module displays the sorting result.
[0067] Specifically, a formula for calculating the similarity value of each English geographical name in the candidate geographical name set and the search keyword in the similarity calculating module comprises:
2018102145 12 Oct 2018 sim(P, VF) len(N) max(len(P),len(lV))’ wherein P refers to a character string of the search keyword, W refers to a character string of the English geographical name, sim (P, W) refers to a sequence similarity value between P and W, len (N), len (P) and len (W) refer to character string length values of N, P and W respectively, and N refers to characters with a same sequence between P and W.
[0068] To enable those skilled in the art to understand the present disclosure more clearly, contents of the above examples will be described in detail with a geographical name “Aalders
Lang Brook” in combination with FIG. 1. For convenience of description and understanding, descriptions will be made in a logical sequence of a process of generating an index — a process of searching candidate geographical names - a process of sorting sequence similarities. [0069] (I ) The process of generating an index includes the following blocks.
[0070] At block 11, feature values including the total number of letters, the number of letter radicals, the total number of words and word initial codes of all geographical name phrases stored in the English geographical name dictionary are sequentially counted. The geographical name “Aalders Lang Brook” is taken as an example, and the total number of letters is 16. For the number of letter radicals, the numbers of appearances of six radicals, i.e., “—”, “\”, “(” and “)” are 9, 8, 3, 2, 12 and 9 respectively. The total number of words is 3. The word initial codes are 1, 12, 2 and 0 respectively.
[0071] At block 12, an index dictionary file is constructed. In the index dictionary, the total number of letters, the number of letter radicals, the total number of words, the word initial codes and inverted list position information are sequentially recorded in each index term respectively.
The geographical name “Aalders Lang Brook” is taken as an example. The total number of letters is 16, the numbers of letter radicals are 9, 8, 3, 2, 12 and 9, the total number of words is 3, and the word initial codes are 1, 12, 2 and 0. Therefore, the multi-dimensional feature vector is expressed as [16, [9, 8, 3, 2, 12, 9], 3, [1, 12, 2, 0]]. Further, because of the inverted list position mapping information <1001>, the index term structure in the index dictionary file is ([16, [9, 8, 3, 2, 12, 9], 3, [1, 12, 2, 0]], <1001>).
[0072] At block 13, an inverted chain file is constructed. Each index term appearing in the dictionary corresponds to one inverted chain, and the inverted chain records hit information of the index term in the geographical name dictionary by using a data structure (tf, <pi,p2, ·,Pf>) of one document hit record. The multi-dimensional feature vector [16, [9, 8, 3, 2, 12, 9], 3, [1, 12, 2, 0]] is taken as an example, and its corresponding inverted list position mapping information is <1001>, that is, storage position information of phrases with all multidimensional feature vectors in the English geographical name dictionary is stored at the position 1001 of the inverted chain file. For example, if the information recorded at the position 1001 of the inverted chain file is (<5>, <7>, ..., < 125>, ...), it indicates that the storage positions of the relevant geographical name phrases in the English geographical name dictionary are 5, 7, ..., 125,... respectively.
[0073] (II) The process of searching candidate geographical names includes the following blocks.
[0074] At block 21, standardization processing is performed for a submitted queried geographical name, that is, words in the geographical name phrase are converted into a form with initial letters being capitalized. The queried geographical name “Alders langbrook” is taken as an example, the phrase is required to be converted into “Alders Lang Brook”.
[0075] At block 22, according to a feature statistical rule during index construction, different feature values of the queried geographical name are counted, and organized into a vector form, which is denoted as Q=[qfcn, qfar, qfwn, qfiw]. The queried geographical name “Alders langbrook” is taken as an example, the total number of letters is 15, the numbers of letter radicals are 9, 8, 3, 2, 10 and 9, the total number of words is 3, the word initial codes are 1, 12, 2 and 0, and the multi-dimensional statistical vector is [15, [9, 8, 3, 2, 10, 9], 3, [1, 12, 2, 0]].
[0076] At block 23, Q and the index term in the index dictionary are compared. The index term di is a candidate term when the formula (4) is satisfied.
f Qfcn [fen X (1 — kcn)> fen X (1 + ^cn)]
I Qfar [far X (1 — karffar X (1 + kar)] | Qfwn [fwn X (1 - kwn)> fwn X (1 + ^wn)]
Qfiw [fiw X (1 - X (1 + kfiv)] [0077] In the above formula, fcn refers to the total number of letters, far refers to the number of letter radicals, fwn refers to the total number of words, and fjw refers to the initial codes; kcn refers to a dimensional threshold of the total number of letters, kar refers to a dimensional threshold of the number of letter radicals, kwn refers to a dimensional threshold of the total number of words, and kiw refers to a dimensional threshold of the initial codes.
[0078] At block 24, index information in di is reversely analyzed, and geographical name data
2018102145 12 Oct 2018 at a relevant storage position in the geographical name dictionary is obtained according to the corresponding position offset information <pi, p2, ..., pf> in the inverted chain. Result combination is performed for all of the obtained geographical name data to form a candidate geographical name set. The queried geographical name “Alders langbrook” is taken as an example. The index term ([16, [9, 8, 3, 2, 12, 9], 3, [1, 12, 2, 0]], <1001>) obtained at block 23 is a candidate term qdi, all inverted chain mapping position information in qdi is analyzed, and the relevant record <1001> is queried in the inverted chain. Then, the relevant geographical name phrases are queried in the English geographical name dictionary file by use of the dictionary storage position information (<5>, <7>,..., < 125>,...) included in the record <1001 >, 0 and all geographical names form a candidate geographical name set C.
[0079] (III) The process of sorting sequence similarities includes the following blocks.
[0080] At block 31, similarities between geographical names are determined by counting a quantity proportion of characters with the same sequence between two character strings. It is assumed that there are two geographical name character strings, i.e., P=pip2...pn and 5 W=wiw2...wm , N refers to characters with a same sequence between P and W. The same sequence of N is determined based on two principles. (1) A principle of a locally same sequence. N is formed by a locally similar term Isi, and a plurality of Isi may exist between P and W. If a sub-string qi=pjpj+ i...pk exists in P, and is completely same as a sub-string /...wt in W, Is, conforms to the principle of the locally same sequence, and is set to one locally similar term. 0 (2) A principle of a totally same sequence. N is formed by Is, with the same sequence between
P and W. For example, P=“Aalders Lang Brook” and JF=“Lang Aalders Brook”. “Aalders”, “Lang” and “Brook” are locally similar terms Is/, ls2 and ls3 respectively according to the principle of the locally same sequence. The sequence in P is lsils2lss, and the sequence in W is Is2ls/ls3. If the sequence in the queried geographical name P is taken as a reference, Is/lss 25 conforms to the principle of the totally same sequence, and therefore, N=ls/ls3. The similarity of the geographical names P and W may be calculated based on the formula (5).
len(N) ( 5 sim(P IV Ί =-----------------max(-len(-py ien^W^ ) [0081] In the above formula, sim (P, W) refers to a sequence similarity value between P and W, and len (N), len (P) and len (W) refer to character string length values of N, P and W respectively. That is, the similarity of “Aalders Lang Brook” and “Lang Aalders Brook” is 30 12/16-0.75.
[0082] At block 32, the sequence similarities are sorted. The geographical names Cq in the candidate geographical name set C are sorted in a descending sequence of similarities based on the similarity calculation result of block 31, and the geographical names Cq ranked at the first Ο n positions are taken as query results.
+_> [0083] Experimental analysis
O
O [0084] In this example, to verify technical effects of the present disclosure, an English 5 geographical name dictionary is constructed with 115,000 pieces of English geographical name data as an example, and 5409 geographical names are extracted as standard geographical names
IT) from the dictionary. A test set is constructed by adding artificial errors in the standard geographical names. The error types include a plurality of inaccurate description manners (such ru
O as redundant letters, missing letter, letter error and letter sequence replacement), and are divided OO 0 into five levels (as shown in the table) according to an accuracy of the standard geographical O names with added errors compared with the original standard geographical names. A definition of the accuracy is as shown in the formula (6).
accu(P, C) = — (6) k 7 N [0085] In the above formula, A refers to the number of correct characters of the queried geographical name P compared with the target geographical name C, N refers to the number of 5 characters of the queried geographical name P, and accu (P, C) refers to the accuracy of P.
| Level | Accuracy of a single geographical name in the test set | Number of geographical names in the test set | Example of the test set |
| Test set 1 | [90%, 100%) | 1080 | Abbott Afine (Abbott Mine); Abbotts Creek Centetery (Abbotts Creek Cemetery) |
| Test set 2 | [80%, 90%) | 1002 | Black Oak Baptist Cl (Black Oak Baptist Church) ; Black WalntttChttrch (Black Walnut Church) |
| Test set 3 | [70%, 80%) | 1091 | Aldi ne Clatlrch (Aldine Church) ; }\lexlslt (Alex Island ) |
| Test set 4 | [60%, 70%) | 1060 | Allyns (Allyns Lake) ; AntpersandNlc (Ampersand Mountain) |
| Test set 5 | [50, 60%) | 1049 | BlxtedloveRc (BreedloveReservoir) ; BozeCc(Boze Cemetery) |
Table 2 Division detail of the test set in an example
Note: contents in parentheses are target geographical names corresponding to the tested geographical names, that is, standard geographical names.
[0086] In addition, in the experiment, the query results of the queried geographical names with different accuracies are as shown in the following Table 3 according to the present disclosure.
2018102145 12 Oct 2018
| Test set | Number of geographical names in the test set (pc) | Number of accurately queried geographical names (pc) | Accuracy (%) | Average efficiency (ms) |
| 1 | 1080 | 1008 | 93.33 | 14 |
| 2 | 1002 | 934 | 93.21 | 11 |
| 3 | 1091 | 847 | 77.64 | 14 |
| 4 | 1060 | 720 | 67.92 | 13 |
| 5 | 1049 | 647 | 61.68 | 12 |
Table 3 Statistics of evaluation indicators of experiment results [0087] The experimental results show that the method of querying an English geographical name dictionary based on inverted index of feature statistics according to the examples of the present disclosure not only maintains a high operating efficiency in a large-scale data environment, but also obtains a target geographical name accurately in the case of an inaccurate description of a queried geographical name.
[0088] The above examples are only illustrative of the principles and effects of the present 0 disclosure, rather than intended to limit the present disclosure. Any person skilled in the art may make modifications or changes to the above examples without departing from the spirit and scope of the present disclosure. Therefore, all equivalent modifications or changes completed by those of ordinary skill in the art without departing from the spirit and technical idea of the present disclosure shall still be encompassed in the claims of the present disclosure.
Claims (8)
1. A method of establishing an English geographical name index, the method being applied to a user device, and comprising:
counting a plurality of feature values of all English geographical name phrases stored in 5 text of English geographical name dictionary, wherein the feature value comprises the total number of letters, the number of letter radicals, the total number of words and word initial codes;
generating a group of corresponding multi-dimensional feature statistical vectors according to the different feature values of the English geographical name phrases; and establishing an inverted index file by using the multi-dimensional feature statistical vectors 0 of different English geographical name phrases and their position mapping information in an inverted list as index terms, wherein each of the index terms corresponds to one inverted chain respectively.
2. The method of establishing an English geographical name index according to claim 1, wherein the multi-dimensional feature statistical vector is:
5 di — [fcn>far>fwn>fiw\>
wherein di refers to the multi-dimensional feature statistical vector of the English geographical name, fcn refers to the total number of letters, far refers to the number of letter radicals, l'wn refers to the total number of words, fiw refers to initial codes, the far comprises number information of six radicals, and the flw comprises initial code information of the first 0 four words in the English geographical name phrase.
3. The method of establishing an English geographical name index according to claim 2, further comprising:
when an English geographical name is queried according to the index term, comparing a search keyword with the index term; when the index term satisfies the following condition, 25 taking the index term as a candidate term of query, wherein the condition comprises:
i tyfon [fcn X (1 — kcn)> fcn x (1 + ^cn)]
I tffar [far X (1 — ^ar)> far X (1 + | tffwn [fwn X (1 - ^wn)> fwn X (1 + ^wn)]
Tlfiw [fiw X (1 - kiw),fiw X (1 + k/w)] wherein qfcn refers to the total number of letters in the search keyword, qfar refers to the number of letter radicals in the search keyword, qfwn refers to the total number of words in the search keyword, qfiw refers to the initial codes in the search keyword, kcn refers to a dimension threshold of the total number of letters, kar refers to a dimension threshold of the number of letter radicals, kwn refers to a dimension threshold of the total number of words, and kiw refers to a dimension threshold of the initial codes.
2018102145 12 Oct 2018
5 4. A method of querying an English geographical name, the method being applied to a user device, and comprising:
obtaining a search keyword input by a user on the user device;
querying a candidate geographical name set relating to the search keyword according to an index file pre-established in an English geographical name database, wherein the index file 0 stored on the user device is constructed according to the method of establishing an English geographical name index of claim 1 or 2; and returning the candidate geographical name set to the user device for display.
5. The method of querying an English geographical name according to claim 4, wherein after the candidate geographical name set is obtained, the method further comprises:
5 calculating a similarity value of each English geographical name in the candidate geographical name set and the search keyword; and sorting the English geographical names in the candidate geographical name set in a descending sequence of the similarity values, and returning a sorting result to the user device for display.
0
6. The method of querying an English geographical name according to claim 4 or 5, wherein the similarity value of each English geographical name in the candidate geographical name set and the search keyword is calculated based on the following formula:
sim(P, VP) len(N) max(len(P),len(lV))’ wherein P refers to a character string of the search keyword, W refers to a character string 25 of the English geographical name, sim (P, W) refers to a sequence similarity value between P and W, len (N), len (P) and len (W) refer to character string length values of N, P and W respectively, and N refers to characters with a same sequence between P and W.
7. An apparatus for querying an English geographical name, the apparatus being applied to a user device, and comprising:
30 a receiving module, configured to obtain a search keyword input by a user on the user device;
a querying module, configured to query a candidate geographical name set relating to the search keyword according to an index file pre-established in an English geographical name database, wherein the index file stored on the user device is constructed according to the method of establishing an English geographical name index of claim 1 or 2; and
5 a displaying module, configured to display the candidate geographical name set returned to the user device.
2018102145 12 Oct 2018
8. The apparatus for querying an English geographical name according to claim 7, further comprising:
a similarity calculating module, configured to calculate a similarity value of each English 0 geographical name in the candidate geographical name set and the search keyword after the candidate geographical name set is obtained; and a sorting module, configured to sort, by the user, the English geographical names in the candidate geographical name set in a descending sequence of the similarity values, and return a sorting result to the user device,
5 wherein the displaying module displays the sorting result.
9. The apparatus for querying an English geographical name according to claim 7 or 8, wherein a formula for calculating the similarity value of each English geographical name in the candidate geographical name set and the search keyword in the similarity calculating module comprises:
sim(P, W) len(N) max(len(P),len(lV))’ wherein P refers to a character string of the search keyword, W refers to a character string of the English geographical name, sim (P, W) refers to a sequence similarity value between P and W, len (N), len (P) and len (W) refer to character string length values of N, P and W respectively, and N refers to characters with a same sequence between P and W.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810945986.8A CN109165331A (en) | 2018-08-20 | 2018-08-20 | A kind of index establishing method and its querying method and device of English place name |
| CN201810945986.8 | 2018-08-20 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| AU2018102145A4 true AU2018102145A4 (en) | 2019-11-21 |
Family
ID=64896023
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| AU2018102145A Ceased AU2018102145A4 (en) | 2018-08-20 | 2018-10-12 | Method of establishing English geographical name index and querying method and apparatus thereof |
Country Status (3)
| Country | Link |
|---|---|
| CN (1) | CN109165331A (en) |
| AU (1) | AU2018102145A4 (en) |
| WO (1) | WO2020037794A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110309151A (en) * | 2019-06-18 | 2019-10-08 | 精硕科技(北京)股份有限公司 | A kind of index establishing method, device and computer readable storage medium |
Families Citing this family (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110275970B (en) * | 2019-06-21 | 2022-05-06 | 北京达佳互联信息技术有限公司 | Image retrieval method, device, server and storage medium |
| CN113011174B (en) * | 2020-12-07 | 2023-08-11 | 红塔烟草(集团)有限责任公司 | Method for identifying purse string based on text analysis |
| CN113268972B (en) * | 2021-05-14 | 2022-01-11 | 东莞理工学院城市学院 | Intelligent calculation method, system, equipment and medium for appearance similarity of two English words |
| CN114580560B (en) * | 2022-03-15 | 2025-02-21 | 浪潮软件科技有限公司 | A model construction method and device for commodity name classification |
| CN115759055A (en) * | 2022-11-16 | 2023-03-07 | 扬州大学 | English place name proofreading method considering multi-dimensional character characteristics |
Family Cites Families (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2001043236A (en) * | 1999-07-30 | 2001-02-16 | Matsushita Electric Ind Co Ltd | Similar word extraction method, document search method, and apparatus used therefor |
| CN100452048C (en) * | 2006-06-02 | 2009-01-14 | 凌阳科技股份有限公司 | Method and system for inquiring words of electronic dictionary by using letter index table |
| CN101840406B (en) * | 2009-03-20 | 2015-10-14 | 富士通株式会社 | Place name searching device and system |
| CN101930435B (en) * | 2009-10-27 | 2013-03-20 | 深圳市北科瑞声科技有限公司 | Method and system for retrieving organization names |
| CN101794307A (en) * | 2010-03-02 | 2010-08-04 | 光庭导航数据(武汉)有限公司 | Vehicle navigation POI (Point of Interest) search engine based on internetwork word segmentation idea |
| US10497042B2 (en) * | 2016-08-29 | 2019-12-03 | BloomReach, Inc. | Search ranking |
| CN108205578A (en) * | 2016-12-20 | 2018-06-26 | 北大方正集团有限公司 | Index generation method and device |
| CN107133311A (en) * | 2017-04-28 | 2017-09-05 | 安徽博约信息科技股份有限公司 | Network information ownership place index marker method based on regional code |
-
2018
- 2018-08-20 CN CN201810945986.8A patent/CN109165331A/en active Pending
- 2018-10-12 AU AU2018102145A patent/AU2018102145A4/en not_active Ceased
- 2018-10-12 WO PCT/CN2018/109938 patent/WO2020037794A1/en not_active Ceased
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110309151A (en) * | 2019-06-18 | 2019-10-08 | 精硕科技(北京)股份有限公司 | A kind of index establishing method, device and computer readable storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2020037794A1 (en) | 2020-02-27 |
| CN109165331A (en) | 2019-01-08 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| AU2018102145A4 (en) | Method of establishing English geographical name index and querying method and apparatus thereof | |
| US8745077B2 (en) | Searching and matching of data | |
| US10515090B2 (en) | Data extraction and transformation method and system | |
| US8855998B2 (en) | Parsing culturally diverse names | |
| CN101464896B (en) | Voice fuzzy retrieval method and apparatus | |
| US9110980B2 (en) | Searching and matching of data | |
| TW202422362A (en) | Sensitive data identification method, device, equipment and computer storage medium | |
| CN101976253B (en) | Chinese variation text matching recognition method | |
| CN105528411B (en) | Device and method for full-text retrieval of ship equipment interactive electronic technical manual | |
| CN107784110B (en) | A kind of index establishment method and apparatus | |
| US10552398B2 (en) | Database records associated with a tire | |
| CN100524293C (en) | Method and system for obtaining word pair translation from bilingual sentence | |
| CN109933787A (en) | Method, device and medium for extracting text key information | |
| CN105760359B (en) | Question processing system and method thereof | |
| CN109885641B (en) | Method and system for searching Chinese full text in database | |
| CN104239294B (en) | Hide the how tactful Tibetan language long sentence cutting method of Chinese translation system | |
| CN114881053B (en) | Sentence Granularity Transformation Testing Method for Neural Machine Translation System | |
| CN112784227A (en) | Dictionary generating system and method based on password semantic structure | |
| CN106776590A (en) | A kind of method and system for obtaining entry translation | |
| CN113065002A (en) | Chinese semantic disambiguation method based on knowledge graph and context | |
| Liang | Spell checkers and correctors: A unified treatment | |
| Goslin et al. | English Language Spelling Correction as an Information Retrieval Task Using Wikipedia Search Statistics | |
| CN108595584B (en) | Chinese character output method and system based on digital marks | |
| QasemiZadeh et al. | Adaptive language independent spell checking using intelligent traverse on a tree | |
| CN114756647A (en) | Standard term instant detection method based on pre-training language model |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| FGI | Letters patent sealed or granted (innovation patent) | ||
| MK22 | Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry |