[go: up one dir, main page]

WO2009005492A1 - Systèmes et procédés pour valider une adresse - Google Patents

Systèmes et procédés pour valider une adresse Download PDF

Info

Publication number
WO2009005492A1
WO2009005492A1 PCT/US2007/015123 US2007015123W WO2009005492A1 WO 2009005492 A1 WO2009005492 A1 WO 2009005492A1 US 2007015123 W US2007015123 W US 2007015123W WO 2009005492 A1 WO2009005492 A1 WO 2009005492A1
Authority
WO
WIPO (PCT)
Prior art keywords
street name
input
character string
address
search table
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2007/015123
Other languages
English (en)
Inventor
James Daniel Self
Robert F. Snapp
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
US Postal Service (USPS)
Original Assignee
US Postal Service (USPS)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by US Postal Service (USPS) filed Critical US Postal Service (USPS)
Priority to PCT/US2007/015123 priority Critical patent/WO2009005492A1/fr
Publication of WO2009005492A1 publication Critical patent/WO2009005492A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/08Logistics, e.g. warehousing, loading or distribution; Inventory or stock management
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation

Definitions

  • This invention generally relates to character string validation and error correction and, more specifically, to selecting the best matching word for an incorrectly spelled word, such as a misspelled street name in an address.
  • text is typically processed using a standard encoding scheme (e.g., ASCII or Unicode) to represent each of the individual characters (e.g., a letter or a number) in a word or a number.
  • a standard encoding scheme e.g., ASCII or Unicode
  • An entire word or number, or group of words or numbers, is typically represented by a set or string of characters in a standard encoding scheme.
  • character strings are employed to represent information related to items that need to be delivered, such as a piece of mail or a package.
  • a delivery address indicating the location to which an item is to be delivered may be represented by a character string, or set of character strings.
  • the delivery address may come from various sources: it may be read from the surface of a delivery item by an OCR system; it may come from an electronic mailing list; it may be scanned in from a paper mailing list; etc.
  • a word or number may have an error in it. Errors may be in the form of misspellings, typographical errors, incorrect information, incorrect words, transposed numbers, misread characters, etc. Such errors are often introduced when a word or number is entered into a computer file by a human typist, optical character recognition system, scantron reader, speech recognition system, etc.
  • Address information may be used for other purposes that require low error rates in address validation and correction processes, in addition to directing items for delivery.
  • the USPS® uses address information to determine whether a customer has filed a change-of-address ("COA") order with the USPS® and to automatically forward a delivery item to a customer's new address when appropriate.
  • COA change-of-address
  • Other delivery services may have similar systems and abilities.
  • One example of a source of addresses that require validation and correction is a mailing list.
  • Organizations typically use mailing lists containing the names and addresses of individuals interested in the organizations' products or services to send material to multiple recipients. Such mailing lists are typically kept in a computer-readable form, such as a text file or a database file.
  • An organization may provide a mailing list to a delivery service, such as the U.S. Postal Service, for use in sending, for example, newsletters, periodicals, or advertising to the individuals on the mailing list.
  • a delivery service such as the U.S. Postal Service
  • Organizations wish to avoid wasting materials and money by sending material to invalid or incorrect addresses contained in their mailing list.
  • mailing lists are valuable in their own right. For some organizations, such as specialized niche publications or charitable groups, their mailing lists may be revenue-generating assets. There are even mailing list brokers that help organizations maximize the value of their mailing lists by renting or selling them. The value of a mailing list is enhanced when the addresses on it are valid and error-free.
  • Embodiments consistent with the present invention include systems, methods, and software for validating an address comprising operations and/or apparatus for identifying a set of street name character strings corresponding to the streets in a defined geographic focal locale; organizing the set of street name character strings into a fast search table; receiving an input address string containing an input street name character string field and an input building number string field, wherein the input address string represents a location within the defined geographic focal locale; searching the fast search table for a matching street name character string that exactly matches the input street name character string field; if an exactly matching string is not found, determining the matching street name character string from the fast search table to be a street name character string that most closely matches the input street name character string field; accessing, according to the matching street name character string, a single address record from a plurality of address records in a comprehensive address data set, wherein the single address record includes a number range; calculating whether the input building number string field represents a number that is within the number range; and if the input building number string field represents a number
  • FIG. 1 For embodiments consistent with the present invention, include systems, methods, and software for validating an address using operations and apparatus for receiving an input address having an input street name field and an input building number field, wherein the input address represents a location within a defined geographic area; searching a fast search table corresponding to the defined geographic area for a matching street name that exactly matches the input street name field, wherein the fast search table comprises representations of streets in the defined geographic area; if an exactly matching street name is not found, assigning the matching street name to be a street name from the fast search table that exceeds a predetermined threshold of similarity to the input street name field; accessing a number range from an address data record corresponding to the matching street name, wherein the address data record is one among a plurality of address data records; calculating whether the input building number field represents a number that is encompassed by the number range; and if the input building number field represents a number that is encompassed by the number range, outputting an indication that the input address is valid.
  • Figure 1A is a representation of an exemplary address information data set including phonetic code representations
  • Figure 1B is a representation of an exemplary address information data set consistent with an embodiment of the invention.
  • Figures 2A and 2B are a flow chart of an exemplary process for recognizing and correcting errors in a digital representation of an address consistent with an embodiment of the invention
  • Figure 3A is a representation of an exemplary search table of character strings consistent with an embodiment of the invention
  • Figure 3B is a diagram of an exemplary location description character string divided into fields of character strings consistent with an implementation of the invention
  • Figure 3C is a representation of an exemplary ranked list of error- corrected character strings consistent with an embodiment of the invention.
  • Figure 4 illustrates an exemplary computing system that may be used to implement embodiments of the invention.
  • the USPS® has developed systems and techniques to recognize and correct errors in the computer or digital representations of words and numbers, including the words or numbers in an address used by the USPS® for directing the delivery of items. Other delivery services may have similar systems.
  • One USPS® system in this area is called the address matching engine or ZIP+4® engine, which is a computer application that uses an address data set listing all cities and streets organized by delivery area and including the ranges of street numbers for the buildings that the USPS® delivers to along those streets. Certain embodiments of the ZIP+4® engine are described in U.S. Patent No. 7,031 ,959, which is hereby incorporated herein by reference.
  • the ZIP+4® engine accepts an input address , such as "123 Main, Greatbend, KS,” and first produces a corresponding digital representation, such as a character string, in a standardized format, such as "123 MAIN ST, GREAT BEND, KS 67532-1439.”
  • the ZIP+4® engine evaluates the standardized street name field (e.g., "MAIN ST") and the building number field (e.g., "123") to verify that there is an actual street name in the delivery area specified by the address (e.g., an area that encompasses ZIP CodeTM "67532”) that matches the input street name, and if so, that the street number is within the range of valid building numbers for that matching street.
  • the ZIP+4® engine typically evaluates an area larger than the ZIP CodeTM area identified in the input address, including, as explained below, a USPS® finance number area. As mentioned, to perform this evaluation, the ZIP+4® engine uses a predetermined, address information data set maintained by the USPS® for each delivery area.
  • Figure 1 A is a representation of an exemplary address information data set used by the ZIP+4® engine. As shown, this data set groups all streets in a ZIP CodeTM delivery code area (column 105) and relates to each standardized street name (column 115) a range of building numbers (column 120) that the USPS® delivers to along that street. For example, the set of rows labeled 130 indicates the streets (Oak, Elm, . . .) that are in a geographic area including ZIP CodeTM 67530. In this example, the number range "100-500" (labeled 140 in Fig. 1A) is the range of numbers that includes all the building numbers on Elm St. in ZIP CodeTM 67530.
  • the data set also includes the Soundex code representation (column 110) (explained below) of each standardized street name (column 115), and may contain other information (represented by 125) that is not important to this explanation. There may be additional rows or records 127 included in the data set.
  • the data set may be stored on a computer-readable medium for access by a computer application, such as the ZIP+4® engine.
  • the ZIP+4® engine searches for street names that are a phonetic match for the street name portion ("MAIN") of the address, and then evaluates the associated number range of each phonetically matching street name to determine whether the range encompasses the building number portion ("123") from the input address.
  • a phonetically matching street name is a street name that is spelled differently from the input street name, but that sounds similar when pronounced.
  • phonetic matching the basic aim is for words with the same pronunciation to be encoded to the same output representation so that matching can occur despite minor differences in speHing. Of the various phonetic algorithms, Soundex is perhaps the most widely known.
  • the Soundex codes representing corresponding street names are shown in column 110.
  • the ZIP+4® engine searches column 110 for Soundex codes that are the same as the Soundex code for the input street name.
  • the data table includes four streets (label 135), "Mane,” “Maine,” “Mine,” and “Main,” that have a Soundex code representation of "M200000.”
  • the ZIP+4® engine would perform further processing on the data for each of those four Soundex- matching streets 135.
  • the phonetic algorithm used in the ZIP+4® engine executes the following steps: (1 ) preserve the first character of the street name (e.g., the "M” from “Main”); (2) condense the street name by eliminating embedded spaces and repeated consonants (e.g., "East Main” becomes “EastMain”); and (3) assign each remaining consonant in the condensed word a numeric code according to the phonetic rules of the algorithm, until the end of the word is reached or until six codes have been assigned.
  • the ZIP+4® engine uses the following Soundex-based phonetic rules in its algorithm: a. Assign a 0 to each "S” and “Z”; b. Assign a 1 to each "B” and “P”; c.
  • This set of rules yields representations for examples of street names as shown in column 110 of the data set shown in Fig. 1 A.
  • the Soundex phonetic algorithm produces the same code "M200000 " to represent the character strings for the street names "Mane,” “Maine,” “Mine,” and “Main,” and it is frequently the case that the address data set will have several street names that are phonetic matches for an input street name. Consequently, the ZIP+4® engine often spends a large amount of time performing multiple accesses to the address data set to get information needed for building number range processing and performing the number range processing algorithms multiple times.
  • the type of phonetic algorithm used in the ZIP+4® engineTM is said to be "left-weighted,” which means the matching logic assumes that the first characters of the input word are spelled correctly.
  • this type of algorithm produces the same phonetic code for the words “MAIN” and “MAINE,” but very different phonetic codes for "MAIN” (M200000) and 11 EMAIN 1 (E220000)
  • a left-weighted algorithm will produce a phonetic code that will differ greatly from the correctly spelled word and would not consider the words to be a fuzzy match.
  • the ZIP+4® engineTM performs address range check processing using number range data 120 from the data set. For the input address "123 MAIN ST, GREAT BEND, KS 67532-1439" example, the ZIP+4® engine would access the data set and evaluate whether the building number "123" is within the street number range 147 for Main Street, (and determine that it is not, because the range 147 is from "400-499"), and then perform similar accesses and evaluations for "Mine,” "Maine,” and “Mane.” Multiple iterations of the address range check processing are time consuming and inefficient. Embodiments consistent with the principals of the invention solve many of the shortcomings of the ZIP+4® engine.
  • Figures 2A and 2B are a flow chart of an exemplary process for recognizing and correcting errors in a digital representation of an address consistent with an embodiment of the invention.
  • the process begins by extracting street name information for a given focal locale from a comprehensive address data set (stage 205).
  • the comprehensive address data set may be a legacy data set such as the data set used by the ZIP+4® engine, which is represented in Fig. 1 A.
  • the comprehensive address data set contains additional information, such as geographic locale information, in addition to other address information such as street name, delivery point building numbers and/or building number range, and ZIP CodeTM delivery codes.
  • Figure 1B is a representation of an exemplary comprehensive address information data set 150 consistent with such an embodiment of the invention.
  • the address information data set 150 may contain street name information 165, building number range information 170, and other information 175 related to addresses, all conceptually organized in rows or records for each street name 165.
  • the address information for the entire data set is grouped by focal locale 155 such that all the streets in the geographic area represented by the focal locale identifier "02" are in the same data table.
  • address information for several focal locales may be contained in the same data set and indexed by the focal locale attribute 155.
  • a focal locale attribute 155 may be added to the address information in each row of a legacy data set, such as the data set shown in Fig. 1 A, allowing all the address information for streets in the same focal locale to be accessed, searched, and grouped together.
  • the focal locale may be any defined geographic area.
  • the focal locale is larger than a single ZIP CodeTM area, because although it may not be wise to make the focal locale too large, (which may result in many duplicate street names or similar spelled street names within nearby towns and cities), it increases efficiency to make the scope of the focal locale large enough to capture cases where the address the sender intended can be matched to an address in the surrounding geographic area, even if is not in the exact town or city specified in the input address.
  • the focal locale may be determined by any criteria.
  • the USPS® assigns a "finance number" to groups of delivery areas across the country, where each delivery area in the group corresponding to a given finance number may include several cities, several ZIP CodesTM areas, and possibly span across more than one state.
  • the finance number associated with the city, state, and/or ZIP CodeTM delivery code of an address is considered the focal locale for that address.
  • the focal locale may be an area encompassed by a group of contiguous ZIP CodeTM areas, city, county, state, or other political subdivision.
  • an address information data set may contain street name aliases, along with the standardized street names, which may be useful in determining where a sender intended an item to be delivered.
  • a street name alias may include, for example, the former name of a street whose name was changed.
  • an address data set 150 may not include the street name column 165 in implementations where the street names extracted into a search table (explained in the next stage) include links to their corresponding row of data in the address data set 150.
  • the process next constructs a search table out of the street names for all the streets in the focal locale (stage 210).
  • the search table is termed a fast search table to represent that it is preferably organized for employment of a rapid searching algorithm.
  • the fast search table is a data structure that includes the street name strings in an alphabetically ordered list so that a binary search algorithm can be employed on the list.
  • Figure 3A is a representation of an exemplary search table of character strings 300 consistent with one embodiment of the invention. As shown in the example of Fig.
  • the street names 310 are arranged in alphabetical order to facilitate a rapid binary search of table 300, and all the streets in table 300 are from the same focal locale 305, which is represented by an arbitrary locale number "02" in this example to correspond with the example of a comprehensive address data set 150 shown in Fig. 1B.
  • the fast search table 300 may be any organization of the street names that facilitates a rapid search of the street name to determine whether or not a specified input street name is among the street names in the table.
  • the street names may not require any particular organization because the search algorithm does not benefit from any particular organization of the character strings.
  • another data structure such as a hash table, may be used to aid an algorithm searching for an input street name in a search table.
  • the fast search table 300 contains street name aliases in addition to standardized street names for a focal locale.
  • the focal locale 305 may correspond to a finance number geographic area, or other geographic area.
  • fast search tables are constructed for all focal locales of interest, for instance, all the USPS® finance number geographic areas in the United States.
  • the process receives an input address and determines the focal locale corresponding to the address (stage 215), as shown in Fig. 2A.
  • the input address is an address that needs to be validated and corrected, if necessary.
  • the input address may come from any source, such as mailing list, an OCR system that reads the address from an item, a mailing database, a customer database, an employee record, a government record, or some other source.
  • Figure 3B which is a diagram of an exemplary input address character string divided into fields of character strings, a U.S.
  • each of these fields contains information in a character string.
  • addresses from other countries having different formats, fields, or components than the one shown may be similarly processed after simple adaptation of the disclosed embodiments.
  • an input address may be digitally represented in a computer in other formats in addition to character strings, and that such representations may be similarly processed after simple adaptation of the disclosed embodiments.
  • the process may determine the focal locale corresponding to the input address based on the city name field 330, the state name field 335, and/or the ZIP CodeTM field 340, or any combination of these fields.
  • the USPS® maintains a database of ZIP CodeTM delivery codes belonging to each finance number.
  • the process my determine the finance number focal locale by looking up the finance number corresponding to the ZIP CodeTM delivery code in the ZIP CodeTM field 340 of the input address. Using the input address shown in Fig.
  • the process performs a search of the appropriate fast search table for the focal locale to determine whether the fast search table contains a street name matching the street name field 325 of the input address (stage 220).
  • the process may perform a binary search of table 300 (Fig. 3A) for the street name "Marne.”
  • Other implementations may use a search algorithm other than a binary search algorithm, such as an interpolation search algorithm, Graver's search algorithm, or a hash table search algorithm, among others.
  • stage 225 the process determines whether there is an exact match for the input street name in the fast search table. If so (stage 225, yes), then the process branches to stage 240. Otherwise (stage 225, no), the process branches to stage 230.
  • a rapid search in stage 220 may be advantageous in embodiments that process a large number of input addresses, such as might come from a mailing list, because time saved quickly finding exact matches (stage 225, yes) may offset time spent searching for non-exact matches (stage 225, no).
  • the process uses the matching input street name to access information about the street in the comprehensive address data set, such as the data set 150 shown in Fig. 1 B. Because the fast search table is generated from the street names in the comprehensive address data set, there is a direct one-to-one correspondence between the street names in the fast search table and a data record or row in the comprehensive address data set. In an implementation where the comprehensive address data set is organized as a database, the matching input street name acts as an index, key, or link to the exact, single record corresponding to that street name in the database, which is used in further processing in subsequent stages.
  • the process compares the input street name to street names in the fast search table 300 for the focal locale determined from the input address. In one embodiment, the process compares the input street name to every street name in the fast search table 300.
  • stage 230 creates a ranked list of the street names from the fast search table organized in order of the degree to which each street name from the fast search table matches the input street name.
  • Figure 3C is a representation of an exemplary ranked list 360 of street names consistent with the invention. Because none the character strings in the ranked list 360 exactly matches the input street name, they may be referred to as "fuzzy" matches to the input string.
  • stage 230 uses a non-phonetic matching algorithm to determine fuzzy matches by measuring the similarity between two words.
  • a distance algorithm is an example of a non-phonetic algorithm
  • the Levenshtein Distance algorithm is a well known example of a distance algorithm.
  • Other types of non-phonetic algorithms such as those that measure string metrics or edit distances, (e.g., the Hamming distance algorithm), may also be used to measure the similarity between two words or numbers.
  • Some implementations of distance algorithms output a similarity percentage figure (0 - 100%) after comparing two character strings, which can be used to create a ranked list of fuzzy matches.
  • the Levenshtein Distance algorithm may indicate an 80% similarity between "Marne” (the input street name) and "Maine” (a street name from fast search table 300), a 60% similarity between "Marne” (the input street name) and "Main” (another street name from fast search table 300), etc.
  • stage 230 produces a ranked list 360 of street names 350, as shown in Fig. 3C, which reflects how closely each street name from the focal locale matches the input street name.
  • the most similar street name from the focal locale is "Maine" 315, which is ranked first 355 in the ranked list 360, followed by the next most similar street name "Main,” followed by "Mane,” etc.
  • a phonetic algorithm such as the Soundex algorithm described previously, is not suitable for use in stage 230 because it cannot produce an indication of the degree of similarity between two character strings, and thus cannot be used to create a ranked list or determine which character string is most similar to an input character string.
  • stage 235 the process selects the top- ranked street name as the matching street name that most closely corresponds to the input street name. In effect, this corrects errors, such as a spelling error in the input street name (e.g., "Marne”) by replacing the input street name string with a matching, error-free street name string (e.g., "Maine”) from the focal locale encompassing the input address.
  • stage 240 uses the selected matching street name from stage 235 to access a comprehensive address data set 150.
  • stage 245 uses information from the comprehensive address data set 150 to determine whether the building number from the input address is within the number range for the matching street name (stage 245). If the building number is within the number range for the street (stage 245, yes), then the process branches to stage 250. Otherwise (stage 245, no), the process branches to stage 255. In stage 255, the process outputs an indication that the input address was not found in the focal locale, and ends. In stage 250, the process outputs an indication that the input address is valid, and ends. In some embodiments, the output of stage 250 includes the correctly spelled matching street name or the entire corrected input address.
  • the building number in the number field 320 from the input address "99 Marne St, Great Bend, KS 67532" is "99”
  • the matching street name from stage 235 is "Maine.”
  • the data set row or record 180 for "Maine” has an address range "10-199,” which encompasses the building number "99” from the input address, and therefore the process branches to stage 250 in this example and outputs an indication that the input address "99 Maine St, Great Bend, KS 67532" is valid.
  • the process may output a corrected version of the input address (in this case, correcting "Marne” to "Maine”) along with the validity indicator.
  • a mailing list may be updated with corrected street names and/or addresses based on the output of stage 250 so that the mailing list contains only corrected addresses; a mailing list may be updated to delete invalid addresses based on the output of stage 255 so that the mailing list contains only valid addresses; a package may be returned to the sender based on the output of stage 255; the focal locale may be expanded and the process run again based on the output of stage 255; the input address may be provided to a human operator for further analysis based on the output of stage 255; or for embodiments that output the address with a corrected street name from stage 250, the corrected address may be verified by a separate system, such as the USPS®'s DPVTM system, which accepts an input address and confirms that at least one delivery has been previously made to that delivery point address.
  • a separate system such as the USPS®'s DPVTM system, which accepts an input address and confirms that at least one delivery has been previously made to that delivery point address.
  • stage 230 may be modified to determine whether any street names matched the input street name with a degree of similarity exceeding a specified minimum threshold and output a "not in the focal locale" indication if none of the street names are sufficiently similar to exceed the threshold.
  • the threshold may be implemented as a minimum degree of similarity (or maximum degree of difference) between the input string and a valid character string.
  • a delivery service application may require that only valid character strings that are ranked as having a 67% or higher degree of similarity may be considered a fuzzy match to an input string that is part of a delivery address. Applying a 67% similarity threshold to our example, "Maine” would exceed the threshold for the input character string “Marne,” but “Main” would not.
  • stage 235 may be modified to choose one street name as being the highest ranked when the matching algorithm outputs two or more equally ranked choices.
  • the determination of a choice may be based on other information from the input address, such as the building number or the name of the person or business associated with the input address, analyzed in light of the information in the comprehensive address data set 150 or other related data sets.
  • Stage 235 may assign the highest ranking to one street name over another based on this additional analysis.
  • stage 235 may be modified to notify a human operator when the matching algorithm outputs two or more equally ranked choices, and the operator may assign one of the choices the highest ranking after investigating tie-breaking criteria.
  • stage 235 may be modified to output two or more equally ranked choices with an indication that they are tied, and stages 240, 245, and 250 may be modified to perform a building number range check on each of the tied, equally ranked choices, and if only one passes the range check, output the passing one as the valid street name within the focal locale.
  • a stage may be added after stage 235 to invoke stage 255 in the case where ranked list is empty, or where none of the fuzzy matches in the ranked list exceeds a minimum threshold of similarity to the input street name.
  • stages may be added such that when an exactly matching input street name fails the number range test (stage 245, no), the exactly matching street name is then treated as a non-exactly matching street name and provided as input to stage 230.
  • similar variations of the illustrated process could be applied to fields of an address other than the street name field 325, such as the city name field 330 or the ZIP CodeTM field 340.
  • the process may attempt to match an input ZIP CodeTM delivery code to a table of ZIP CodesTM encompassed by a focal locale determined by the city 330 and state 335 fields of the input address.
  • Figs. 2A and 2B are explained in the context of digital representations of words and numbers that are part of an input address from a delivery service source such as a mailing list file
  • the words or numbers being processed could come from other sources without departing from the scope of the invention.
  • an input character string of interest could have been read by an OCR system, typed in by a user, interpreted from "bubbles" filled in with a number two pencil on a ScantronTM sheet or other machine-readable form, user-entered with a stylus on a touch screen, such as is common on personal digital assistant devices, or obtained from any other source of machine-read character strings.
  • Figure 4 illustrates an exemplary computing system 400 that may be used to implement embodiments of the invention.
  • the components and arrangement, however, are not critical to the present invention.
  • Computing system 400 includes a number of components, such as a central processing unit (CPU) 410, a memory 420, an input/output (I/O) device(s) 430, and a database 460.
  • System 400 that can be implemented in various ways.
  • an integrated platform such as a workstation, personal computer, laptop, etc.
  • components 410, 420, and 430 may connect through a local bus interface and access database 460 (implemented as a separate database platform).
  • the access connection may be implemented through a direct communication link, a local area network (LAN), a wide area network (WAN) and/or other suitable connections.
  • System 400 may be standalone or it may be part of a subsystem, which may, in turn, be part of a larger system, such as an OCR system, sorting system, mailing list maintenance system, inventory system, employee records system, financial records system or document processing system.
  • CPU 410 may be one or more known processing devices, such as a microprocessor from the PentiumTM family manufactured by IntelTM.
  • Memory 420 may be one or more storage devices configured to store information accessed, read, and/or used by CPU 410 to perform certain functions and processes related to embodiments of the present invention.
  • Memory 420 may be a volatile or nonvolatile, magnetic, semiconductor, tape, optical, or other type of storage device or computer-readable medium.
  • memory 420 includes one or more application programs or subprograms 425 that, when executed by CPU 410, perform various methods or processes consistent with the present invention.
  • memory 420 may include a correction program 425 that validates or corrects a digital representation, such as a character string, of a word or number, such as the street name character string from an input address character string, or memory 420 may include a comparison program 425 implementing a process that searches for valid digital representations of a word that match an input word, or memory 420 may include an analysis application program 425 that analyzes information related to the information in a character string for use in determining the correctness of, and if necessary correcting, the character string.
  • Memory 420 may also include other programs that perform other functions and processes, such as programs that maintain electronic mailing lists and programs that perform delivery point verification of a standardized address character string. The programs in memory 420 may communicate with each other.
  • memory 420 may be configured with a program 425 that performs several functions when executed by CPU 410. That is, memory 420 may include a program 425 that performs database information extraction functions, search table construction functions, character recognition functions, digital representation (such as a character string) matching functions, character string substitution or correction functions, and machine control functions.
  • CPU 410 may execute one or more programs located remotely from system 400. For example, system 400 may access one or more remote programs that, when executed, perform functions related to embodiments of the present invention.
  • Memory 420 may be also be configured with an operating system (not shown) that performs several functions well known in the art when executed by CPU 410.
  • the operating system may be Microsoft WindowsTM, UnixTM, LinuxTM, an Apple Computers operating system.
  • Personal Digital Assistant operating system such as Microsoft CETM, or other operating system. The choice of operating system, and even to the use of an operating system, is not critical to the invention.
  • I/O device(s) 430 may comprise one or more input/output devices that allow data to be received and/or transmitted by system 400.
  • I/O device 430 may include one or more input devices, such as a keyboard, touch screen, mouse, and the like, that enable data to be input from a user.
  • I/O device 430 may include one or more output devices, such as a display screen, CRT monitor, LCD monitor, plasma display, printer, speaker devices, and the like, that enable data to be output or presented to a user.
  • I/O device 430 may also include one or more digital and/or analog communication input/output devices that allow computing system 400 to communicate with other machines and devices, including control communications.
  • the configuration and number of input and/or output devices incorporated in I/O device 430 are not critical to the invention.
  • Database 460 may comprise one or more databases that store information and are accessed and/or managed through system 400.
  • database 460 may be an OracleTM database, a SybaseTM database, or other relational database.
  • Database 460 may include, for example, tables or lists of valid digital representations, such as character strings, of address information, such as street name character strings, address information data sets, databases of address fields cross-referenced to other related address fields, geographic data, delivery point data, employee data, governmental data, etc.
  • Systems and methods of the present invention are not limited to separate databases or even to the use of a database, as data can come from practically any source, such as the internet and other organized collections of data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Economics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Development Economics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne des systèmes, des procédés et des logiciels qui déterminent si un champ d'une représentation numérique d'entrée des informations, comme le champ du nom de rue dans une adresse, est correct en comparant rapidement le champ à une liste de choix valables pour ce champ. La liste des choix valables est générée en fonction des informations de la représentation numérique d'entrée, comme une chaîne de caractères. Lorsqu'aucune correspondance exacte n'est trouvée, une comparaison de concordance partielle détermine le choix valable de correspondance la plus proche. En cas de découverte d'une concordance partielle valable, les informations d'entrée ne sont pas valables. Dans le cas contraire, un autre champ des informations d'entrée, comme le champ du numéro de bâtiment d'une adresse, est testé pour connaître sa validité. Si le second réussit la vérification de validité, la concordance partielle (ou exacte) du champ est valable. Un champ à concordance partielle peut remplacer le champ d'entrée, corrigeant ainsi les informations d'entrée.
PCT/US2007/015123 2007-06-29 2007-06-29 Systèmes et procédés pour valider une adresse Ceased WO2009005492A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2007/015123 WO2009005492A1 (fr) 2007-06-29 2007-06-29 Systèmes et procédés pour valider une adresse

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2007/015123 WO2009005492A1 (fr) 2007-06-29 2007-06-29 Systèmes et procédés pour valider une adresse

Publications (1)

Publication Number Publication Date
WO2009005492A1 true WO2009005492A1 (fr) 2009-01-08

Family

ID=40226356

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2007/015123 Ceased WO2009005492A1 (fr) 2007-06-29 2007-06-29 Systèmes et procédés pour valider une adresse

Country Status (1)

Country Link
WO (1) WO2009005492A1 (fr)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101876853A (zh) * 2009-04-29 2010-11-03 北京搜狗科技发展有限公司 拼音输入方法及装置
CN103793457A (zh) * 2012-10-31 2014-05-14 国际商业机器公司 用于使用利用率分析管理存储器利用率的系统和方法
CN106407221A (zh) * 2015-07-31 2017-02-15 阿里巴巴集团控股有限公司 地址数据检索方法及装置
CN113837634A (zh) * 2021-09-29 2021-12-24 深圳云路信息科技有限责任公司 一种基于相似度的行政区划匹配方法及装置
WO2022112857A1 (fr) * 2020-11-25 2022-06-02 商汤国际私人有限公司 Procédé et appareil permettant de corriger des informations de commande, dispositif, et support de stockage
EP4018369A4 (fr) * 2019-10-11 2022-10-12 Samsung Electronics Co., Ltd. Dispositif électronique, procédé et support de stockage non transitoire pour la reconnaissance optique de caractères
US11967542B2 (en) 2019-03-12 2024-04-23 Absolics Inc. Packaging substrate, and semiconductor device comprising same
US11981501B2 (en) 2019-03-12 2024-05-14 Absolics Inc. Loading cassette for substrate including glass and substrate loading method to which same is applied
US12165979B2 (en) 2019-03-07 2024-12-10 Absolics Inc. Packaging substrate and semiconductor apparatus comprising same
US12198994B2 (en) 2019-03-12 2025-01-14 Absolics Inc. Packaging substrate and method for manufacturing same
US12288742B2 (en) 2019-03-07 2025-04-29 Absolics Inc. Packaging substrate and semiconductor apparatus comprising same

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5122959A (en) * 1988-10-28 1992-06-16 Automated Dispatch Services, Inc. Transportation dispatch and delivery tracking system
US6115707A (en) * 1997-02-21 2000-09-05 Nec Corporation Address reading apparatus and recording medium on which a program for an address reading apparatus is recorded
WO2000079426A1 (fr) * 1999-06-18 2000-12-28 The Trustees Of Columbia University In The City Of New York Systeme et procede de detection de similarite de texte sur de courts passages
JP2003030237A (ja) * 2001-07-11 2003-01-31 Just Syst Corp ファイル検索方法とこの方法を利用可能なファイル検索装置、検索サーバ
US6564224B1 (en) * 1999-12-06 2003-05-13 Kivera, Inc. Method and apparatus for merging multiple road map databases
US20030140064A1 (en) * 2002-01-18 2003-07-24 Boundary Solutions, Incorporated Computerized national online parcel-level map data portal

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5122959A (en) * 1988-10-28 1992-06-16 Automated Dispatch Services, Inc. Transportation dispatch and delivery tracking system
US6115707A (en) * 1997-02-21 2000-09-05 Nec Corporation Address reading apparatus and recording medium on which a program for an address reading apparatus is recorded
WO2000079426A1 (fr) * 1999-06-18 2000-12-28 The Trustees Of Columbia University In The City Of New York Systeme et procede de detection de similarite de texte sur de courts passages
US6564224B1 (en) * 1999-12-06 2003-05-13 Kivera, Inc. Method and apparatus for merging multiple road map databases
JP2003030237A (ja) * 2001-07-11 2003-01-31 Just Syst Corp ファイル検索方法とこの方法を利用可能なファイル検索装置、検索サーバ
US20030140064A1 (en) * 2002-01-18 2003-07-24 Boundary Solutions, Incorporated Computerized national online parcel-level map data portal

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101876853A (zh) * 2009-04-29 2010-11-03 北京搜狗科技发展有限公司 拼音输入方法及装置
US11573946B2 (en) 2012-10-31 2023-02-07 International Business Machines Corporation Management of memory usage using usage analytics
CN103793457A (zh) * 2012-10-31 2014-05-14 国际商业机器公司 用于使用利用率分析管理存储器利用率的系统和方法
US9830347B2 (en) 2012-10-31 2017-11-28 International Business Machines Corporation Management of memory usage using usage analytics
US10698875B2 (en) 2012-10-31 2020-06-30 International Business Machines Corporation Management of memory usage using usage analytics
CN106407221A (zh) * 2015-07-31 2017-02-15 阿里巴巴集团控股有限公司 地址数据检索方法及装置
US12288742B2 (en) 2019-03-07 2025-04-29 Absolics Inc. Packaging substrate and semiconductor apparatus comprising same
US12165979B2 (en) 2019-03-07 2024-12-10 Absolics Inc. Packaging substrate and semiconductor apparatus comprising same
US11967542B2 (en) 2019-03-12 2024-04-23 Absolics Inc. Packaging substrate, and semiconductor device comprising same
US11981501B2 (en) 2019-03-12 2024-05-14 Absolics Inc. Loading cassette for substrate including glass and substrate loading method to which same is applied
US12198994B2 (en) 2019-03-12 2025-01-14 Absolics Inc. Packaging substrate and method for manufacturing same
EP4018369A4 (fr) * 2019-10-11 2022-10-12 Samsung Electronics Co., Ltd. Dispositif électronique, procédé et support de stockage non transitoire pour la reconnaissance optique de caractères
WO2022112857A1 (fr) * 2020-11-25 2022-06-02 商汤国际私人有限公司 Procédé et appareil permettant de corriger des informations de commande, dispositif, et support de stockage
CN113837634A (zh) * 2021-09-29 2021-12-24 深圳云路信息科技有限责任公司 一种基于相似度的行政区划匹配方法及装置

Similar Documents

Publication Publication Date Title
US7769778B2 (en) Systems and methods for validating an address
WO2009005492A1 (fr) Systèmes et procédés pour valider une adresse
US8468167B2 (en) Automatic data validation and correction
US8391614B2 (en) Determining near duplicate “noisy” data objects
US5850480A (en) OCR error correction methods and apparatus utilizing contextual comparison
Evershed et al. Correcting noisy OCR: Context beats confusion
US6542896B1 (en) System and method for organizing data
US7092567B2 (en) Post-processing system and method for correcting machine recognized text
JP5710624B2 (ja) 抽出のための方法及びシステム
JP2010092490A (ja) データ整理のための方法及びシステム
US7415171B2 (en) Multigraph optical character reader enhancement systems and methods
US20230205800A1 (en) System and method for detection and auto-validation of key data in any non-handwritten document
US11663408B1 (en) OCR error correction
US20250363302A1 (en) Mapping entities in unstructured text documents via entity correction and entity resolution
JP3812818B2 (ja) データベース生成装置、データベース生成方法及びデータベース生成処理プログラム
CN112395874B (zh) 订单信息的校正方法、装置、设备及存储介质
JP2655087B2 (ja) 文字認識後処理方式
JP2003331214A (ja) 文字認識誤り訂正方法、装置及びプログラム
EP1076305A1 (fr) Un procédé phonétique pour retrouver et présenter des informations électroniques de grandes sources d'informations, un dispositif pour mettre en oeuvre ce procédé, un medium lisible par ordinateur, et un élément de programme d'ordinateur
Jain Data Cleaning using a Matching Dependency Technique
Christen Data pre-processing
Hauser et al. Correcting OCR text by association with historical datasets
KR20220142901A (ko) 반정형 문서로부터 정보를 추출하는 방법 및 시스템
Taghva et al. Extracting _Carbon Copy_ Names and Organizations from a Heterogeneous Document Collection
CN119336987A (zh) 一种科技信息综合管理方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07810042

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07810042

Country of ref document: EP

Kind code of ref document: A1