[go: up one dir, main page]

AU2018102145A4 - Method of establishing English geographical name index and querying method and apparatus thereof - Google Patents

Method of establishing English geographical name index and querying method and apparatus thereof Download PDF

Info

Publication number
AU2018102145A4
AU2018102145A4 AU2018102145A AU2018102145A AU2018102145A4 AU 2018102145 A4 AU2018102145 A4 AU 2018102145A4 AU 2018102145 A AU2018102145 A AU 2018102145A AU 2018102145 A AU2018102145 A AU 2018102145A AU 2018102145 A4 AU2018102145 A4 AU 2018102145A4
Authority
AU
Australia
Prior art keywords
geographical name
english
refers
geographical
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU2018102145A
Inventor
Mi DU
Peng Ye
Xueying ZHANG
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Normal University
Original Assignee
Nanjing Normal University
Nanjing Tech University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Normal University, Nanjing Tech University filed Critical Nanjing Normal University
Application granted granted Critical
Publication of AU2018102145A4 publication Critical patent/AU2018102145A4/en
Ceased legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/907Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/909Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed is a method of querying an English geographical name dictionary, which belongs to the field of natural language processing. The method of querying an English geographical name dictionary based on an inverted index of feature statistics is obtained by performing 5 geographical name query with text features such as the total number of letters, the number of letter radicals, the total number of words and word initial codes in a geographical name according to a main line of "statistics of multi-dimensional features - generation of an inverted index - query of candidate geographical names - sorting of similarities". This method not only maintains a high operating efficiency in a large-scale data environment, but also obtains a target 0 geographical name accurately in the case of an inaccurate description of a queried geographical name. In this way, the users will have better experiences.

Description

) METHOD OF ESTABLISHING ENGLISH GEOGRAPHICAL NAME INDEX AND > QUERYING METHOD AND APPARATUS THEREOF
I •
) TECHNICAL FIELD
I [0001] The present disclosure relates to the field of natural language processing, and in particular to a method of querying an English geographical name dictionary oriented to largeI scale geographical name data.
I BACKGROUND ; [0002] Query of a geographical name dictionary is a basic operation of applications such as > geographical name spelling checking, fuzzy matching and optical recognition, and provides ' 0 knowledge support for geographical names. With the accelerated globalization process, geographical names are transmitted internationally in a increasingly high speed and used in a increasingly high frequency. English, as one of widely used languages in the world, is usually used as a standard for translation, storage and management of geographical names between different languages or characters. Meanwhile, with explosive growth of data and rapid development of information storage technology, a large-scale geographical name data set becomes increasingly popular. Therefore, how to efficiently query an English geographical name dictionary in a large-scale data environment becomes an important technical challenge to improve numerous geographical name services and applications.
[0003] In a conventional method of querying a dictionary, a query record is generally obtained by a sequential traversal or binary search method. However, it is very difficult for the method to satisfy an actual need when applied in massive data since its operation efficiency has a linear relationship with a data size. An inverted file, as a simple and efficient document data indexing manner, is a basic technology to realize a modern search engine retrieval system, and therefore is gradually introduced into a dictionary query mechanism. Word-level indexing is a general 25 organization manner for the inverted file to realize a phrase or proximity query, where an Ngram index is the commonest word-level indexing structure. Although the N-gram structure improves a recall rate of the query to a certain extent, lemmas generated by the N-gram usually greatly increases spatial resource occupation, resulting in reduced speeds of construction processing and query processing. In addition, similarity calculation is required when the fuzzy 30 query is performed with an index term formed in a morpheme form, and similarity comparison is performed for each index term and a query condition. Such query mode greatly increases complexity of an operation mechanism, and application requirements of the large-scale data environment are very difficult to satisfy.
) [0004] Therefore, to satisfy the actual application requirements of different scenarios, how to > efficiently return a completely accurate or the closest query result in a case of an inaccurate and
I > incomplete input during query of an English geographical name is to be studied and solved by ) those skilled in the art at present.
I * SUMMARY [0005] Technical problems )
[0006] The technical problems to be solved by the present disclosure include: how to I efficiently return a completely accurate or the closest query result in a case of an inaccurate and
J incomplete input during query of an English geographical name, and other technical problems * relevant to the technical problem.
' 0 [0007] Summary [0008] The summary of technical contribution contents is as follows: the key of improving query performance is to extract text features contained in the English geographical name and combine the text features with the dictionary query mechanism. According to the present disclosure, a method of querying an English geographical name dictionary based on an inverted 5 index of feature statistics is disclosed by performing geographical name query with the text features such as the total number of letters, the number of letter radicals, the total number of words and word initial codes of the geographical name according to the main line of “statistics of multi-dimensional features - generation of an inverted index - query of candidate geographical names - sorting of similarities”.
[0009] Technical solutions [0010] According to a first aspect, the present disclosure provides a method of establishing an English geographical name index. The method may be applied to a user device, including: amounting a plurality of feature values of all English geographical name phrases stored in text of English geographical name dictionary, wherein the feature value comprises the total number 25 of letters, the number of letter radicals, the total number of words and word initial codes;
b) generating a group of corresponding multi-dimensional feature statistical vectors according to the different feature values of the English geographical name phrases; and
c) establishing an inverted index file by using the multi-dimensional feature statistical vectors of different English geographical name phrases and their position mapping information in an inverted list as index terms, wherein each of the index terms corresponds to one inverted chain respectively.
[0011] The process and principle of the above method of establishing an English geographical name index are detailed below.
2018102145 12 Oct 2018 [0012] Firstly, for statistics of feature values, the feature values of all geographical name phrases stored in the English geographical name dictionary are sequentially counted. The statistics of the feature values include: the total number of letters, the number of letter radicals, the total number of words and word initial codes. (1) The total number of letters refers to a sum of all letters included in a geographical name phrase. (2) The letter radicals refer to that each
English letter is assumed to be comprised of partial radicals of six radicals, i.e., “|” , “— ” , “I” , , “ (” and “)” according to a hieroglyphic thought of Chinese characters, where radical expressions of different letters are shown in the following Table 1. Obviously, two character strings are more similar if the two characters have more same 0 characters. However, if there are more English letters, an appearance frequency of each letter recorded in the index term will occupy excessive storage space and is not conducive to comparison of character strings. Therefore, when each letter is expressed by fixed radicals, comparison complexity of query may be simplified under the precondition of implicitly recording features of the appearance frequency of the letter. (3) The total number of words 5 refers to a sum of all words included in the geographical name phrase. (4) The word initial codes refer to that the initial letters of words in the geographical name phrase are converted into digital codes. A conversion rule is as follows: mapping the codes of “01” to “26” in a sequence of A to Z respectively, that is, the code of A is “01”, the code of B is “02”, and so on. In a code conversion process, the initial letters are converted into capital letters in a unified way.
letter radical letter radical letter radical letter radical
A 423 B 166 C 5 D 16
E 1222 F 122 G 51 H 121
I 212 J 213 K 143 L 12
M 1313 N 131 0 56 P 116
Q 563 R 1163 s 526 T 21
u 151 V 34 w 1414 X 643
Y 341 Z 242 a 55 b 16
c 5 d 51 e 255 f 5162
g 516 h 126 i 2212 j 2216
k 243 1 125 m 1313 n 131
0 56 P 116 q 514 r 126
s 526 t 215 u 1515 v 34
w 1414 x 643 y 3415 z 242
Table 1 [0013] The radical ‘ j” is indicated by the number 1, the radical “—” is indicated by the number
2, the radical is indicated by the number 3, the radical “\” is indicated by the number 4, the radical “(” is indicated by the number 5, and the radical “)” is indicated by the number 6. [0014] Further, for the composition of the index term, the total number of letters, the number of letter radicals, the total number of words, the world initial codes and inverted list position information are sequentially recorded in each index term in an index dictionary respectively. In the formula, fcn refers to the total number of letters for recording a 1-dimensional vector; far refers to the number of letter radicals for recording a 6-dimensional vector, where there is number information of a total of six radicals;/;,,, refers to the total number of words for recording 5 the 1-dimensional vector; f™refers to the word initial codes for recording a 4-dimensional vector, where information of the word initial codes of the first four words of the geographical name phrase is recorded in this method, and the vacancy resulting from less than four words is complemented with the codes “00”. These vectors are combined according to the formulas (1), (2) and (3) to establish a 12-dimensional vector di. The vector di, as an index term, fully 0 represents the text features of the character strings of the English geographical name, and this is taken as an entrance of querying the English geographical name.
di ~ \.fcn>far>fwn>ftw\ (1)
ar \farl’ far2> (2)
iw fiw2> > fiw4·] (3)
[0015] Further, for construction of an inverted chain file, each index term appearing in the dictionary corresponds to one inverted chain, and the inverted chain records hit information of the index term in the geographical name dictionary by using a data structure (tf, <pl, p2, ..., pf>) of one document hit record. In the structure, tf refers to the number of appearances of the index term in the geographical name dictionary, and pi refers to position offset information of each appearance in the geographical name dictionary. All pieces of hit information are orderly arranged to form their corresponding inverted chains.
[0016] According to a second aspect, the present disclosure further provides a method of querying an English geographical name. The method is applied to a user device, including: obtaining a search keyword input by a user on the user device; querying a candidate geographical name set relating to the search keyword according to an index file pre-established in an English geographical name database, wherein the index file stored on the user device is constructed according to the method of establishing an English geographical name index of first 25 aspect; and returning the candidate geographical name set to the user device for display.
[0017] Process and principle of the above method of querying an English geographical name are detailed below.
[0018] A process of selecting a candidate geographical name set is described below.
[0019] Firstly, standardization processing is performed for a submitted queried geographical 30 name, that is, words in the geographical name phrase are converted into a form with initial letters being capitalized.
[0020] Secondly, according to a feature statistical rule during index construction, different feature values of the queried geographical name are counted, and organized into a vector form, ) which is denoted as Q=[qfCn, qfar,qfi™, qfiw].
> [0021] Thirdly, Q and the index term in the index dictionary are compared. The index term di
I > is a candidate term when the formula (4) is satisfied.
• ) ί II fen [fen * (1 kcn),fcn X (1 + kcn)] | I qfar [far X (1 kar),far X (1 + kar)] | ifwn [fwn X (1 kwn)> fwn X (1 + ^wn)]
Qfiw [fiw X (1 - kiw),fiw X (1 + /tiw)] ! [0022] In the above formula, fcn refers to the total number of letters, far refers to the number of j 5 letter radicals, fwn refers to the total number of words, and fiw refers to the initial codes; kcn refers ( to a dimensional threshold of the total number of letters, kar refers to a dimensional threshold of
J the number of letter radicals, kwn refers to a dimensional threshold of the total number of words,
I and kiw refers to a dimensional threshold of the initial codes.
[0023] Fourthly, index information in di is reversely analyzed, and geographical name data at 0 a relevant storage position in the geographical name dictionary is obtained according to the corresponding position offset information <pi, p2, ..., pf> in the inverted chain. Result combination is performed for all of the obtained geographical name data to form a candidate geographical name set.
[0024] In some preferred solutions, the above method of querying an English geographical name may further include the following blocks: after the candidate geographical name set is obtained, calculating a similarity value of each English geographical name in the candidate geographical name set and the search keyword; and sorting the English geographical names in the candidate geographical name set in a descending sequence of the similarity values, and returning a sorting result to the user device for display.
[0025] A process of sorting sequence similarities of the candidate geographical name sets is described below.
[0026] Firstly, the sequence similarities are calculated, that is, the sequence similarities of all geographical name phrases in the candidate geographical name set and the queried geographical name are calculated. It is assumed that there are two geographical name character strings, i.e., 25 P=pip2.. .pn and W=wiw2.. .wm, N refers to characters with a same sequence between P and W.
The same sequence of N is determined based on two principles. (1) A principle of a locally same sequence. N is formed by a locally similar term Isi, and a plurality of Isi may exist between P and W. If a sub-string qi=PjPj+i...pk exists in P, and is completely same as a sub-string WsWs+i...wt in W, Isi conforms to the principle of the locally same sequence, and is set to one 30 locally similar term. (2) A principle of a totally same sequence. N is formed by Isi with the same sequence between P and W. The geographical name similarity of P and W is calculated based on the following formula (5).
(5) j len(lV) ; Sim(P-M/) = max(len(P),lenm) 1 [0027] In the above formula, sim (P, W) refers to a sequence similarity value between P and
J W, and len (N), len (P) and len (W) refer to character string length values of N, P and W
I respectively.
[0028] Secondly, the sequence similarities are sorted. The candidate geographical names are ( 5 sorted based on the sequence similarities, and a sorting result is returned as a final query result ( to the user.
| [0029] Technical effects [0030] In the present disclosure, the geographical name is queried according to a main line of > “ statistics of multi-dimensional features - generation of an inverted index - query of candidate ' 0 geographical names - sorting of similarities” by summarizing the multi-dimensional text statistical features such as the number of words and the number of letters in the geographical name. In the process of index generation, the features including the total number of letters, the number of letter radicals, the total number of words and word initial codes in each geographical name record are extracted, and a corresponding inverted index structure is constructed with a 5 vector formed by multi-dimensional features as the index term. In the process of searching candidate geographical names and sorting sequence similarities, standardization processing and multi-dimensional feature extraction are performed for a query request, the candidate geographical name sets are obtained by querying the inverted index according to the generated feature vector, and the candidate sets are returned to the user in a descending sequence of the 0 similarities. It is proved by experiments that the method of querying an English geographical name dictionary based on inverted index of feature statistics according to the examples of the present disclosure not only maintains a high operating efficiency in a large-scale data environment, but also obtains a target geographical name accurately in the case of an inaccurate description of a queried geographical name. In this way, the users will have better experiences.
BRIEF DESCRIPTION OF THE DRAWINGS [0031] FIG. 1 is a flowchart illustrating a method of establishing an English geographical name index according to an example of the present disclosure.
[0032] FIG. 2 is a flowchart illustrating a method of querying an English geographical name according to an example of the present disclosure.
[0033] FIG. 3 is a flowchart illustrating a method of querying an English geographical name according to a preferred example of the present disclosure.
[0034] FIG. 4 is a principle diagram illustrating an apparatus for querying an English > geographical name according to an example of the present disclosure.
> [0035] FIG. 5 is a principle diagram illustrating an apparatus for querying an English
I > geographical name according to a preferred example of the present disclosure.
• ) [0036] FIG. 6 is a graphical flowchart illustrating a method of querying an English
I 5 geographical name dictionary according to an example of the present disclosure.
[0037] FIG. 7 is a schematic diagram illustrating an inverted index structure in a method of » establishing an English geographical name index according to an example of the present ( disclosure.
I [ DETAILED DESCRIPTION OF THE EMBODIMENTS > 0 [0038] Implementations of the present disclosure will be described by specific examples.
' Those skilled in the art may easily understand other advantages and effects of the present disclosure based on contents disclosed in the specification.
[0039] Explanations of technical names [0040] Letter radicals refer to that the composition of capital or small letters is described by using six characters (i.e., radicals) of “—”, “\”, “(” and “)”, that is, any capital or small letter is comprised of partial characters of the six characters. If the six characters of “—”, “\”, “(” and “)” are numbered as 1-6 in sequence respectively, the radicals of any letter may be indicated by a string of numbers. For example, “L” may be comprised of and “—”, so that the radicals of the “L” are indicated by the number “12”.
[0041] The number of letter radicals refers to that the number of all radicals are counted after the radicals of all letters in the English geographical name are indicated by numbers corresponding to the radicals (it is required that the initial letter of each word in the English geographical name is a capital letter).
[0042] The letter codes refer to that the initial letters of words in the geographical name phrase 25 are converted into digital codes. The conversion rule is to map codes from “01” to “26” in a sequence of A to Z, that is, the code of A is “01”, the code of B is “02”, and so on.
[0043] Example 1 [0044] As shown in FIG. 1, this example provides a method of establishing an English geographical name index. The method is applied to a user device, and the method includes the 30 following blocks:
[0045] SH, counting a plurality of feature values of all English geographical name phrases stored in text of English geographical name dictionary, wherein the feature value comprises the total number of letters, the number of letter radicals, the total number of words and word initial codes;
2018102145 12 Oct 2018 [0046] S12, generating a group of corresponding multi-dimensional feature statistical vectors according to the different feature values of the English geographical name phrases; and [0047] SI3, establishing an inverted index file by using the multi-dimensional feature statistical vectors of different English geographical name phrases and their position mapping 5 information in an inverted list as index terms, wherein each of the index terms corresponds to one inverted chain respectively.
[0048] Specifically, the multi-dimensional feature statistical vector is:
d-i = [fen> far> fwn> fiwl/herein di refers to the multi-dimensional feature statistical vector of the English geographical name, fcn refers to the total number of letters, far refers to the number 0 of letter radicals, fwn refers to the total number of words, fiw refers to initial codes, the far comprises number information of six radicals, and the fiw comprises initial code information of the first four words in the English geographical name phrase.
[0049] Specifically, the method of establishing an English geographical name index may further include: when an English geographical name is queried according to the index term, 5 comparing a search keyword with the index term; when the index term satisfies the following condition, taking the index term as a candidate term of query, [0050] wherein the condition comprises:
hfen [fcn X (1 ^cn)> fcn X (1 + ^cn)]
I far θ [far X (1 kar)>far X (1 3 kar)] fwn [fwn X (1 - ^wn)> fwn X (1 + ^wn)] qfiw [fiw X (1 - kiw)>fiw X (1 + kjw)] [0052] wherein qfcn refers to the total number of letters in the search keyword, qfar refers to 20 the number of letter radicals in the search keyword, qfwn refers to the total number of words in the search keyword, qfiw refers to the initial codes in the search keyword, kcn refers to a dimension threshold of the total number of letters, kar refers to a dimension threshold of the number of letter radicals, kwn refers to a dimension threshold of the total number of words, and kiw refers to a dimension threshold of the initial codes.
[0053] Example 2 [0054] As shown in FIG. 2, this example provides a method of querying an English geographical name. The method is applied to a user device, and the method includes the following blocks:
[0055] S21, obtaining a search keyword input by a user on the user device;
[0051] ) [0056] S22, querying a candidate geographical name set relating to the search keyword > according to an index file pre-established in an English geographical name database, wherein
I > the index file stored on the user device is constructed according to the method of establishing •
) an English geographical name index of example 1; and
I 5 [0057] S23, returning the candidate geographical name set to the user device for display.
[0058] As a preferred example, as shown in FIG. 3, after the candidate geographical name set ! is obtained, the method of querying an English geographical name may further include:
I [0059] S31, calculating a similarity value of each English geographical name in the candidate J geographical name set and the search keyword; and ) 0 [0060] S32, sorting the English geographical names in the candidate geographical name set
I in a descending sequence of the similarity values, and returning a sorting result to the user device for display.
[0061] Specifically, the similarity value of each English geographical name in the candidate geographical name set and the search keyword is calculated based on the following formula:
[0062] sim(P, VF) =--- 'θ'.
L J v 7 max(len(P),len(lV)) [0063] wherein P refers to a character string of the search keyword, W refers to a character string of the English geographical name, sim (P, W) refers to a sequence similarity value between P and W, len (N), len (P) and len (W) refer to character string length values of N, P and W respectively, and N refers to characters with a same sequence between P and W.
[0064] Example 3 [0065] As shown in FIG. 4, this example provides an apparatus 300 for querying an English geographical name. The apparatus 300 is applied to a user device, and the apparatus 300 specifically includes a receiving module 310, a querying module 320 and a displaying module 330. The receiving module 310 is configured to obtain a search keyword input by a user on the 25 user device. The querying module 320 is configured to search a candidate geographical name set relating to the search keyword according to an index file pre-established in an English geographical name database, where the index file stored on the user device is constructed according to the method of establishing an English geographical name index of claim 1 or 2.
The displaying module 330 is configured to display the candidate geographical name set 30 returned to the user device.
[0066] In a preferred solution, as shown in FIG. 5, the apparatus for querying an English geographical name further includes a similarity calculating module 410 and a sorting module 420. The similarity calculating module 410 is configured to calculate a similarity value of each
English geographical name in the candidate geographical name set and the search keyword after obtaining the candidate geographical name set. The sorting module 420 is configured to sort
English geographical names in the candidate geographical name set in a descending sequence of the similarity values, and return a sorting result to the user device. The displaying module displays the sorting result.
[0067] Specifically, a formula for calculating the similarity value of each English geographical name in the candidate geographical name set and the search keyword in the similarity calculating module comprises:
2018102145 12 Oct 2018 sim(P, VF) len(N) max(len(P),len(lV))’ wherein P refers to a character string of the search keyword, W refers to a character string of the English geographical name, sim (P, W) refers to a sequence similarity value between P and W, len (N), len (P) and len (W) refer to character string length values of N, P and W respectively, and N refers to characters with a same sequence between P and W.
[0068] To enable those skilled in the art to understand the present disclosure more clearly, contents of the above examples will be described in detail with a geographical name “Aalders
Lang Brook” in combination with FIG. 1. For convenience of description and understanding, descriptions will be made in a logical sequence of a process of generating an index — a process of searching candidate geographical names - a process of sorting sequence similarities. [0069] (I ) The process of generating an index includes the following blocks.
[0070] At block 11, feature values including the total number of letters, the number of letter radicals, the total number of words and word initial codes of all geographical name phrases stored in the English geographical name dictionary are sequentially counted. The geographical name “Aalders Lang Brook” is taken as an example, and the total number of letters is 16. For the number of letter radicals, the numbers of appearances of six radicals, i.e., “—”, “\”, “(” and “)” are 9, 8, 3, 2, 12 and 9 respectively. The total number of words is 3. The word initial codes are 1, 12, 2 and 0 respectively.
[0071] At block 12, an index dictionary file is constructed. In the index dictionary, the total number of letters, the number of letter radicals, the total number of words, the word initial codes and inverted list position information are sequentially recorded in each index term respectively.
The geographical name “Aalders Lang Brook” is taken as an example. The total number of letters is 16, the numbers of letter radicals are 9, 8, 3, 2, 12 and 9, the total number of words is 3, and the word initial codes are 1, 12, 2 and 0. Therefore, the multi-dimensional feature vector is expressed as [16, [9, 8, 3, 2, 12, 9], 3, [1, 12, 2, 0]]. Further, because of the inverted list position mapping information <1001>, the index term structure in the index dictionary file is ([16, [9, 8, 3, 2, 12, 9], 3, [1, 12, 2, 0]], <1001>).
[0072] At block 13, an inverted chain file is constructed. Each index term appearing in the dictionary corresponds to one inverted chain, and the inverted chain records hit information of the index term in the geographical name dictionary by using a data structure (tf, <pi,p2, ·,Pf>) of one document hit record. The multi-dimensional feature vector [16, [9, 8, 3, 2, 12, 9], 3, [1, 12, 2, 0]] is taken as an example, and its corresponding inverted list position mapping information is <1001>, that is, storage position information of phrases with all multidimensional feature vectors in the English geographical name dictionary is stored at the position 1001 of the inverted chain file. For example, if the information recorded at the position 1001 of the inverted chain file is (<5>, <7>, ..., < 125>, ...), it indicates that the storage positions of the relevant geographical name phrases in the English geographical name dictionary are 5, 7, ..., 125,... respectively.
[0073] (II) The process of searching candidate geographical names includes the following blocks.
[0074] At block 21, standardization processing is performed for a submitted queried geographical name, that is, words in the geographical name phrase are converted into a form with initial letters being capitalized. The queried geographical name “Alders langbrook” is taken as an example, the phrase is required to be converted into “Alders Lang Brook”.
[0075] At block 22, according to a feature statistical rule during index construction, different feature values of the queried geographical name are counted, and organized into a vector form, which is denoted as Q=[qfcn, qfar, qfwn, qfiw]. The queried geographical name “Alders langbrook” is taken as an example, the total number of letters is 15, the numbers of letter radicals are 9, 8, 3, 2, 10 and 9, the total number of words is 3, the word initial codes are 1, 12, 2 and 0, and the multi-dimensional statistical vector is [15, [9, 8, 3, 2, 10, 9], 3, [1, 12, 2, 0]].
[0076] At block 23, Q and the index term in the index dictionary are compared. The index term di is a candidate term when the formula (4) is satisfied.
f Qfcn [fen X (1 kcn)> fen X (1 + ^cn)]
I Qfar [far X (1 karffar X (1 + kar)] | Qfwn [fwn X (1 - kwn)> fwn X (1 + ^wn)]
Qfiw [fiw X (1 - X (1 + kfiv)] [0077] In the above formula, fcn refers to the total number of letters, far refers to the number of letter radicals, fwn refers to the total number of words, and fjw refers to the initial codes; kcn refers to a dimensional threshold of the total number of letters, kar refers to a dimensional threshold of the number of letter radicals, kwn refers to a dimensional threshold of the total number of words, and kiw refers to a dimensional threshold of the initial codes.
[0078] At block 24, index information in di is reversely analyzed, and geographical name data
2018102145 12 Oct 2018 at a relevant storage position in the geographical name dictionary is obtained according to the corresponding position offset information <pi, p2, ..., pf> in the inverted chain. Result combination is performed for all of the obtained geographical name data to form a candidate geographical name set. The queried geographical name “Alders langbrook” is taken as an example. The index term ([16, [9, 8, 3, 2, 12, 9], 3, [1, 12, 2, 0]], <1001>) obtained at block 23 is a candidate term qdi, all inverted chain mapping position information in qdi is analyzed, and the relevant record <1001> is queried in the inverted chain. Then, the relevant geographical name phrases are queried in the English geographical name dictionary file by use of the dictionary storage position information (<5>, <7>,..., < 125>,...) included in the record <1001 >, 0 and all geographical names form a candidate geographical name set C.
[0079] (III) The process of sorting sequence similarities includes the following blocks.
[0080] At block 31, similarities between geographical names are determined by counting a quantity proportion of characters with the same sequence between two character strings. It is assumed that there are two geographical name character strings, i.e., P=pip2...pn and 5 W=wiw2...wm , N refers to characters with a same sequence between P and W. The same sequence of N is determined based on two principles. (1) A principle of a locally same sequence. N is formed by a locally similar term Isi, and a plurality of Isi may exist between P and W. If a sub-string qi=pjpj+ i...pk exists in P, and is completely same as a sub-string /...wt in W, Is, conforms to the principle of the locally same sequence, and is set to one locally similar term. 0 (2) A principle of a totally same sequence. N is formed by Is, with the same sequence between
P and W. For example, P=“Aalders Lang Brook” and JF=“Lang Aalders Brook”. “Aalders”, “Lang” and “Brook” are locally similar terms Is/, ls2 and ls3 respectively according to the principle of the locally same sequence. The sequence in P is lsils2lss, and the sequence in W is Is2ls/ls3. If the sequence in the queried geographical name P is taken as a reference, Is/lss 25 conforms to the principle of the totally same sequence, and therefore, N=ls/ls3. The similarity of the geographical names P and W may be calculated based on the formula (5).
len(N) ( 5 sim(P IV Ί =-----------------max(-len(-py ien^W^ ) [0081] In the above formula, sim (P, W) refers to a sequence similarity value between P and W, and len (N), len (P) and len (W) refer to character string length values of N, P and W respectively. That is, the similarity of “Aalders Lang Brook” and “Lang Aalders Brook” is 30 12/16-0.75.
[0082] At block 32, the sequence similarities are sorted. The geographical names Cq in the candidate geographical name set C are sorted in a descending sequence of similarities based on the similarity calculation result of block 31, and the geographical names Cq ranked at the first Ο n positions are taken as query results.
+_> [0083] Experimental analysis
O
O [0084] In this example, to verify technical effects of the present disclosure, an English 5 geographical name dictionary is constructed with 115,000 pieces of English geographical name data as an example, and 5409 geographical names are extracted as standard geographical names
IT) from the dictionary. A test set is constructed by adding artificial errors in the standard geographical names. The error types include a plurality of inaccurate description manners (such ru
O as redundant letters, missing letter, letter error and letter sequence replacement), and are divided OO 0 into five levels (as shown in the table) according to an accuracy of the standard geographical O names with added errors compared with the original standard geographical names. A definition of the accuracy is as shown in the formula (6).
accu(P, C) = — (6) k 7 N [0085] In the above formula, A refers to the number of correct characters of the queried geographical name P compared with the target geographical name C, N refers to the number of 5 characters of the queried geographical name P, and accu (P, C) refers to the accuracy of P.
Level Accuracy of a single geographical name in the test set Number of geographical names in the test set Example of the test set
Test set 1 [90%, 100%) 1080 Abbott Afine (Abbott Mine); Abbotts Creek Centetery (Abbotts Creek Cemetery)
Test set 2 [80%, 90%) 1002 Black Oak Baptist Cl (Black Oak Baptist Church) ; Black WalntttChttrch (Black Walnut Church)
Test set 3 [70%, 80%) 1091 Aldi ne Clatlrch (Aldine Church) ; }\lexlslt (Alex Island )
Test set 4 [60%, 70%) 1060 Allyns (Allyns Lake) ; AntpersandNlc (Ampersand Mountain)
Test set 5 [50, 60%) 1049 BlxtedloveRc (BreedloveReservoir) ; BozeCc(Boze Cemetery)
Table 2 Division detail of the test set in an example
Note: contents in parentheses are target geographical names corresponding to the tested geographical names, that is, standard geographical names.
[0086] In addition, in the experiment, the query results of the queried geographical names with different accuracies are as shown in the following Table 3 according to the present disclosure.
2018102145 12 Oct 2018
Test set Number of geographical names in the test set (pc) Number of accurately queried geographical names (pc) Accuracy (%) Average efficiency (ms)
1 1080 1008 93.33 14
2 1002 934 93.21 11
3 1091 847 77.64 14
4 1060 720 67.92 13
5 1049 647 61.68 12
Table 3 Statistics of evaluation indicators of experiment results [0087] The experimental results show that the method of querying an English geographical name dictionary based on inverted index of feature statistics according to the examples of the present disclosure not only maintains a high operating efficiency in a large-scale data environment, but also obtains a target geographical name accurately in the case of an inaccurate description of a queried geographical name.
[0088] The above examples are only illustrative of the principles and effects of the present 0 disclosure, rather than intended to limit the present disclosure. Any person skilled in the art may make modifications or changes to the above examples without departing from the spirit and scope of the present disclosure. Therefore, all equivalent modifications or changes completed by those of ordinary skill in the art without departing from the spirit and technical idea of the present disclosure shall still be encompassed in the claims of the present disclosure.

Claims (8)

1. A method of establishing an English geographical name index, the method being applied to a user device, and comprising:
counting a plurality of feature values of all English geographical name phrases stored in 5 text of English geographical name dictionary, wherein the feature value comprises the total number of letters, the number of letter radicals, the total number of words and word initial codes;
generating a group of corresponding multi-dimensional feature statistical vectors according to the different feature values of the English geographical name phrases; and establishing an inverted index file by using the multi-dimensional feature statistical vectors 0 of different English geographical name phrases and their position mapping information in an inverted list as index terms, wherein each of the index terms corresponds to one inverted chain respectively.
2. The method of establishing an English geographical name index according to claim 1, wherein the multi-dimensional feature statistical vector is:
5 di — [fcn>far>fwn>fiw\>
wherein di refers to the multi-dimensional feature statistical vector of the English geographical name, fcn refers to the total number of letters, far refers to the number of letter radicals, l'wn refers to the total number of words, fiw refers to initial codes, the far comprises number information of six radicals, and the flw comprises initial code information of the first 0 four words in the English geographical name phrase.
3. The method of establishing an English geographical name index according to claim 2, further comprising:
when an English geographical name is queried according to the index term, comparing a search keyword with the index term; when the index term satisfies the following condition, 25 taking the index term as a candidate term of query, wherein the condition comprises:
i tyfon [fcn X (1 kcn)> fcn x (1 + ^cn)]
I tffar [far X (1 ^ar)> far X (1 + | tffwn [fwn X (1 - ^wn)> fwn X (1 + ^wn)]
Tlfiw [fiw X (1 - kiw),fiw X (1 + k/w)] wherein qfcn refers to the total number of letters in the search keyword, qfar refers to the number of letter radicals in the search keyword, qfwn refers to the total number of words in the search keyword, qfiw refers to the initial codes in the search keyword, kcn refers to a dimension threshold of the total number of letters, kar refers to a dimension threshold of the number of letter radicals, kwn refers to a dimension threshold of the total number of words, and kiw refers to a dimension threshold of the initial codes.
2018102145 12 Oct 2018
5 4. A method of querying an English geographical name, the method being applied to a user device, and comprising:
obtaining a search keyword input by a user on the user device;
querying a candidate geographical name set relating to the search keyword according to an index file pre-established in an English geographical name database, wherein the index file 0 stored on the user device is constructed according to the method of establishing an English geographical name index of claim 1 or 2; and returning the candidate geographical name set to the user device for display.
5. The method of querying an English geographical name according to claim 4, wherein after the candidate geographical name set is obtained, the method further comprises:
5 calculating a similarity value of each English geographical name in the candidate geographical name set and the search keyword; and sorting the English geographical names in the candidate geographical name set in a descending sequence of the similarity values, and returning a sorting result to the user device for display.
0
6. The method of querying an English geographical name according to claim 4 or 5, wherein the similarity value of each English geographical name in the candidate geographical name set and the search keyword is calculated based on the following formula:
sim(P, VP) len(N) max(len(P),len(lV))’ wherein P refers to a character string of the search keyword, W refers to a character string 25 of the English geographical name, sim (P, W) refers to a sequence similarity value between P and W, len (N), len (P) and len (W) refer to character string length values of N, P and W respectively, and N refers to characters with a same sequence between P and W.
7. An apparatus for querying an English geographical name, the apparatus being applied to a user device, and comprising:
30 a receiving module, configured to obtain a search keyword input by a user on the user device;
a querying module, configured to query a candidate geographical name set relating to the search keyword according to an index file pre-established in an English geographical name database, wherein the index file stored on the user device is constructed according to the method of establishing an English geographical name index of claim 1 or 2; and
5 a displaying module, configured to display the candidate geographical name set returned to the user device.
2018102145 12 Oct 2018
8. The apparatus for querying an English geographical name according to claim 7, further comprising:
a similarity calculating module, configured to calculate a similarity value of each English 0 geographical name in the candidate geographical name set and the search keyword after the candidate geographical name set is obtained; and a sorting module, configured to sort, by the user, the English geographical names in the candidate geographical name set in a descending sequence of the similarity values, and return a sorting result to the user device,
5 wherein the displaying module displays the sorting result.
9. The apparatus for querying an English geographical name according to claim 7 or 8, wherein a formula for calculating the similarity value of each English geographical name in the candidate geographical name set and the search keyword in the similarity calculating module comprises:
sim(P, W) len(N) max(len(P),len(lV))’ wherein P refers to a character string of the search keyword, W refers to a character string of the English geographical name, sim (P, W) refers to a sequence similarity value between P and W, len (N), len (P) and len (W) refer to character string length values of N, P and W respectively, and N refers to characters with a same sequence between P and W.
AU2018102145A 2018-08-20 2018-10-12 Method of establishing English geographical name index and querying method and apparatus thereof Ceased AU2018102145A4 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810945986.8A CN109165331A (en) 2018-08-20 2018-08-20 A kind of index establishing method and its querying method and device of English place name
CN201810945986.8 2018-08-20

Publications (1)

Publication Number Publication Date
AU2018102145A4 true AU2018102145A4 (en) 2019-11-21

Family

ID=64896023

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2018102145A Ceased AU2018102145A4 (en) 2018-08-20 2018-10-12 Method of establishing English geographical name index and querying method and apparatus thereof

Country Status (3)

Country Link
CN (1) CN109165331A (en)
AU (1) AU2018102145A4 (en)
WO (1) WO2020037794A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309151A (en) * 2019-06-18 2019-10-08 精硕科技(北京)股份有限公司 A kind of index establishing method, device and computer readable storage medium

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110275970B (en) * 2019-06-21 2022-05-06 北京达佳互联信息技术有限公司 Image retrieval method, device, server and storage medium
CN113011174B (en) * 2020-12-07 2023-08-11 红塔烟草(集团)有限责任公司 Method for identifying purse string based on text analysis
CN113268972B (en) * 2021-05-14 2022-01-11 东莞理工学院城市学院 Intelligent calculation method, system, equipment and medium for appearance similarity of two English words
CN114580560B (en) * 2022-03-15 2025-02-21 浪潮软件科技有限公司 A model construction method and device for commodity name classification
CN115759055A (en) * 2022-11-16 2023-03-07 扬州大学 English place name proofreading method considering multi-dimensional character characteristics

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001043236A (en) * 1999-07-30 2001-02-16 Matsushita Electric Ind Co Ltd Similar word extraction method, document search method, and apparatus used therefor
CN100452048C (en) * 2006-06-02 2009-01-14 凌阳科技股份有限公司 Method and system for inquiring words of electronic dictionary by using letter index table
CN101840406B (en) * 2009-03-20 2015-10-14 富士通株式会社 Place name searching device and system
CN101930435B (en) * 2009-10-27 2013-03-20 深圳市北科瑞声科技有限公司 Method and system for retrieving organization names
CN101794307A (en) * 2010-03-02 2010-08-04 光庭导航数据(武汉)有限公司 Vehicle navigation POI (Point of Interest) search engine based on internetwork word segmentation idea
US10497042B2 (en) * 2016-08-29 2019-12-03 BloomReach, Inc. Search ranking
CN108205578A (en) * 2016-12-20 2018-06-26 北大方正集团有限公司 Index generation method and device
CN107133311A (en) * 2017-04-28 2017-09-05 安徽博约信息科技股份有限公司 Network information ownership place index marker method based on regional code

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309151A (en) * 2019-06-18 2019-10-08 精硕科技(北京)股份有限公司 A kind of index establishing method, device and computer readable storage medium

Also Published As

Publication number Publication date
WO2020037794A1 (en) 2020-02-27
CN109165331A (en) 2019-01-08

Similar Documents

Publication Publication Date Title
AU2018102145A4 (en) Method of establishing English geographical name index and querying method and apparatus thereof
US8745077B2 (en) Searching and matching of data
US10515090B2 (en) Data extraction and transformation method and system
US8855998B2 (en) Parsing culturally diverse names
CN101464896B (en) Voice fuzzy retrieval method and apparatus
US9110980B2 (en) Searching and matching of data
TW202422362A (en) Sensitive data identification method, device, equipment and computer storage medium
CN101976253B (en) Chinese variation text matching recognition method
CN105528411B (en) Device and method for full-text retrieval of ship equipment interactive electronic technical manual
CN107784110B (en) A kind of index establishment method and apparatus
US10552398B2 (en) Database records associated with a tire
CN100524293C (en) Method and system for obtaining word pair translation from bilingual sentence
CN109933787A (en) Method, device and medium for extracting text key information
CN105760359B (en) Question processing system and method thereof
CN109885641B (en) Method and system for searching Chinese full text in database
CN104239294B (en) Hide the how tactful Tibetan language long sentence cutting method of Chinese translation system
CN114881053B (en) Sentence Granularity Transformation Testing Method for Neural Machine Translation System
CN112784227A (en) Dictionary generating system and method based on password semantic structure
CN106776590A (en) A kind of method and system for obtaining entry translation
CN113065002A (en) Chinese semantic disambiguation method based on knowledge graph and context
Liang Spell checkers and correctors: A unified treatment
Goslin et al. English Language Spelling Correction as an Information Retrieval Task Using Wikipedia Search Statistics
CN108595584B (en) Chinese character output method and system based on digital marks
QasemiZadeh et al. Adaptive language independent spell checking using intelligent traverse on a tree
CN114756647A (en) Standard term instant detection method based on pre-training language model

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)
MK22 Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry