ES2302578B1

ES2302578B1 - PROCEDURE TO REDUCE THE NUMBER OF INDIVIDUAL COMPARISONS OF SECURITIES IN A PROCESS IN WHICH IT IS NECESSARY TO OBTAIN DISTANCES OF SIMILARITY TWO TO TWO BETWEEN SECURITIES OR RECORDS.

Info

Publication number: ES2302578B1
Application number: ES200501598A
Authority: ES
Inventors: Jordi Gomez Bao; Josep Lluis LARRIBA PEY; Jordi Galvez Escañuela; Josepa Ribes Puig
Original assignee: Universitat Politecnica de Catalunya UPC
Current assignee: Universitat Politecnica de Catalunya UPC
Priority date: 2005-06-23
Filing date: 2005-06-23
Publication date: 2009-05-20
Anticipated expiration: 2025-06-23
Also published as: ES2302578A1

Abstract

Procedimiento para reducir el número de comparaciones individuales de valores en un proceso en el que sea necesario obtener distancias de similitud dos a dos entre valores o registros.Procedure to reduce the number of individual comparisons of values in a process where it is necessary to obtain distances of two-by-two similarity between values or records.

Comprende las etapas de creación de diccionarios de valores para cada atributo de un conjunto de tuplas asignando a cada valor distinto un identificador único; la cuantificación de la frecuencia de aparición de cada valor; la sustitución de los valores de las tuplas por el identificador asignado; la ordenación de los valores descendentemente por su frecuencia de aparición; la traducción de los identificadores de las tuplas por los nuevos identificadores dependientes de la frecuencia; la inicialización del almacén de similitudes; y la comparación de parejas de tuplas mediante la comparación individual de sus valores consultando el almacén de similitudes para evitar repetir la comparación entre dos valores previamente comparados. La invención propone el uso de unas estructuras de datos llamadas almacenes de similitudes que pueden ser implementadas en software o hardware y utilizadas en computadores paralelos o con un solo procesador, en cualquier tipo de aplicación informática o de telecomunicación donde se utilicen computadores para su control. Tales almacenes de similitudes son definidos para un subconjunto de los atributos del conjunto de tuplas. Cada almacén de similitudes guarda los pesos de similitud asignados por una función de comparación a dos valores.
It includes the stages of creating value dictionaries for each attribute of a set of tuples, assigning each different value a unique identifier; the quantification of the frequency of appearance of each value; the substitution of the values of the tuples by the assigned identifier; the ordering of the values descending by their frequency of appearance; the translation of the identifiers of the tuples by the new frequency-dependent identifiers; the initialization of the similarity store; and the comparison of pairs of tuples by individually comparing their values by consulting the store of similarities to avoid repeating the comparison between two previously compared values. The invention proposes the use of data structures called stores of similarities that can be implemented in software or hardware and used in parallel computers or with a single processor, in any type of computer or telecommunication application where computers are used for their control. Such stores of similarities are defined for a subset of the attributes of the set of tuples. Each similarity store stores the similarity weights assigned by a comparison function to two values.

De este modo, tal función sólo es aplicada la primera vez que ambos valores deben ser comparados, aprovechando en posteriores comparaciones el valor guardado en el almacén de similitudes.In this way, such a function is only applied the first time that both values must be compared, taking advantage of subsequent comparisons the value stored in the storage similarities.

Description

La presente solicitud de Patente de Invención consiste conforme indica su enunciado en un "Procedimiento para reducir el numero de comparaciones individuales de valores en un proceso en el que sea necesario obtener distancias de similitud dos a dos entre valores o registros" cuyas nuevas características proporcionan numerosas ventajas tal como se detallará en la presente memoria.The present application for an Invention Patent It consists as indicated by its statement in a "Procedure for reduce the number of individual comparisons of values in a process in which it is necessary to obtain distances of similarity two to two between values or records "whose new characteristics provide numerous benefits as will be detailed in the present memory.

Más concretamente, la invención se refiere a un procedimiento para almacenar el peso de similitud asignado por una función de comparación para su posterior reaprovechamiento, evitando de este modo repetir el cálculo. El procedimiento requiere que las funciones de comparación asignen un peso de similitud a cada pareja de valores, como sería el caso en la comparación aproximada entre cadenas de caracteres. Esencialmente, el procedimiento se basa en recordar los pesos de similitud asignados por una función de comparación a parejas de valores.More specifically, the invention relates to a procedure for storing the similarity weight assigned by a comparison function for later reuse, thus avoiding repeating the calculation. The procedure requires have the comparison functions assign a similarity weight to each pair of values, as would be the case in the comparison approximate between character strings. Essentially, the procedure is based on remembering the assigned similarity weights by a comparison function to pairs of values.

Definitions

Se entiende por identificador aquel valor que permite dar a conocer un elemento de un conjunto unívocamente.An identifier is understood to be the value that allows to unambiguously reveal an element of a set.

Un ejemplo de identificador podría ser un número entero que permite identificar unívocamente a un valor concreto.An example identifier could be a number integer that uniquely identifies a value concrete.

Se entiende por tupla o registro (a partir de ahora, tupla) el conjunto de valores cuyo significado es descrito por un conjunto de atributos y que describen una determinada entidad como, por ejemplo, un determinado sujeto, objeto o evento.It is understood by tuple or record (from now, tuple) the set of values whose meaning is described by a set of attributes and that describe a certain entity such as a certain subject, object or event.

Se entiende por estructura de datos informática (a partir de ahora, estructura de datos), un espacio de memoria organizado de tal manera que las posiciones que almacena se pueden acceder de forma predeterminada a través de un programa informático.It is understood by computer data structure (from now on, data structure), a memory space organized in such a way that the positions it stores can be accessed by default through a program computer scientist.

Un ejemplo de estructura de datos podría ser una matriz bidimensional triangular inferior (o superior) de D por D posiciones, que no contiene la diagonal principal. Esta matriz almacena valores enteros, booleanos o reales. Cada dimensión de la matriz puede direccionarse a través de un índice, por ejemplo, i y j, notándose (i,j) en forma compacta. En esta estructura, los rangos del índice deben cumplir 1<=i<j<=D (1<=j<i<=D para una matriz triangular superior), entendiendo i como el índice vertical (o de fila) y j como el índice horizontal (o de columna).An example of a data structure could be a lower (or upper) triangular two-dimensional matrix of D by D positions, which does not contain the main diagonal. This matrix stores integer, boolean, or real values. Each dimension of the array can be addressed through an index, for example i and j, noting (i, j) in compact form. In this structure, index ranges must meet 1 <= i <j <= D (1 <= j <i <= D for an upper triangular matrix), understanding i as the vertical (or row) index and j as the horizontal (or column) index.

Otro ejemplo de estructura de datos podría ser un diccionario, donde cada valor diferente tiene asociada una definición o un conjunto de sinónimos.Another example of a data structure could be a dictionary, where each different value has an associated definition or a set of synonyms.

Se entiende por peso de similitud el resultado de la aplicación a través de un programa informático de una función de comparación sobre dos valores asignando una medida de distancia o similitud. La función de comparación puede dar un resultado booleano de igual o distinto, o una gradación en una escala de valores representada a través de números enteros o reales que permita estimar la distancia o grado de similitud existente entre los dos valores.By weight of similarity is understood the result of the application through a one-function computer program comparison on two values assigning a distance measure or similarity. The compare function can give a result Boolean of equal or different, or a gradation on a scale of values represented by integers or real numbers that allows estimating the distance or degree of similarity existing between the two values.

Process

Se explica el procedimiento a través del proceso en el que se comparan entre sí un conjunto de tuplas compuestas por un atributo, a pesar de que el procedimiento es generalizable a tuplas con cualquier número de atributos. Se realizan todas las comparaciones posibles entre tuplas, aunque el proceso podría limitarse a un subconjunto de estas.The procedure is explained through the process in which a set of composite tuples are compared to each other by an attribute, even though the procedure is generalizable to tuples with any number of attributes. All the possible comparisons between tuples, although the process could be limited to a subset of these.

El objetivo del procedimiento es evitar repetir la comparación entre dos valores almacenando su peso de similitud en una estructura de datos que llamaremos almacén de similitudes en adelante. Para el ejemplo usaremos una matriz triangular inferior (o superior) como almacén de similitudes. El procedimiento se dividirá en las siguientes etapas:The goal of the procedure is to avoid repeating the comparison between two values storing their similarity weight in a data structure that we will call a store of similarities in ahead. For the example we will use a lower triangular matrix (or higher) as a store of similarities. The procedure is divided into the following stages:

Creación de un diccionario. Al principio se hará una lectura de todas las tuplas, creando un diccionario con los distintos valores de las tuplas para el atributo. Los valores se insertarán en el diccionario por orden de aparición asignándoles un identificador único. Creation of a dictionary . At the beginning, a reading of all the tuples will be done, creating a dictionary with the different values of the tuples for the attribute. The values will be inserted into the dictionary in order of appearance, assigning them a unique identifier.

Cuantificación de la frecuencia de aparición de los valores. Se contará el número de apariciones de cada valor en el conjunto de tuplas para obtener su frecuencia. Quantification of the frequency of appearance of the values . The number of occurrences of each value in the set of tuples will be counted to obtain its frequency.

Reescritura de las tuplas con identificadores. Simultáneamente a la creación del diccionario, se rescribirán todas las tuplas de tal forma que cada valor del atributo será sustituido por su identificador. Así, las tuplas estarán compuestas por un identificador que permite recuperar del diccionario el valor original. Rewriting the tuples with identifiers . Simultaneously with the creation of the dictionary, all the tuples will be rewritten in such a way that each value of the attribute will be replaced by its identifier. Thus, the tuples will be composed of an identifier that allows to retrieve the original value from the dictionary.

Ordenación de los valores del diccionario por frecuencia de aparición. Una vez extraída la información del número de apariciones de cada valor en el conjunto de tuplas, se ordenarán los valores descendentemente por frecuencia de aparición. Ordering of dictionary values by frequency of occurrence . Once the information on the number of occurrences of each value in the set of tuples has been extracted, the values will be ordered descending by frequency of occurrence.

Reasignación de identificadores. Habiendo reordenado los valores del diccionario se reasignarán los identificadores siguiendo una función inversa de la frecuencia, de tal modo que a mayor frecuencia menor identificador (aunque la reasignación podría realizarse siguiendo otro tipo de función). Se realizará una traducción del conjunto de tuplas substituyendo el antiguo identificador por el nuevo identificador dependiente de la frecuencia de aparición del valor. Reassignment of identifiers . Having reordered the dictionary values, the identifiers will be reassigned following an inverse function of frequency, in such a way that the higher the frequency, the lower the identifier (although the reassignment could be carried out following another type of function). A translation of the set of tuples will be carried out, replacing the old identifier with the new identifier depending on the frequency of appearance of the value.

Inicialización del almacén de similitudes. Se comenzará inicializando la matriz triangular inferior mencionada anteriormente con un valor nulo para cada una de sus posiciones, indicando que el peso correspondiente a los valores cuyos identificadores determinan la posición no ha sido calculado. Siendo N el número de valores distintos del diccionario, la matriz triangular tendrá un número de filas y columnas igual a D<=N. En el caso de D<N existirán parejas de valores sin una posición asociada en el almacén de similitudes. Similarity store initialization . We will start by initializing the lower triangular matrix mentioned above with a null value for each of its positions, indicating that the weight corresponding to the values whose identifiers determine the position has not been calculated. Where N is the number of different values in the dictionary, the triangular matrix will have a number of rows and columns equal to D <= N. In the case of D <N there will be pairs of values without an associated position in the store of similarities.

Comparación de tuplas. El proceso de comparación de parejas de tuplas tiene como objetivo encontrar una medida de distancia entre las mismas, y se realizará mediante la comparación de cada uno de los atributos, dos a dos, para cada pareja. Tuples comparison . The objective of the process of comparing pairs of tuples is to find a measure of distance between them, and it will be carried out by comparing each of the attributes, two by two, for each pair.

Dados dos identificadores i y j para un atributo, se determinará la posición de la matriz correspondiente al peso de similitud entre los valores correspondientes a esos identificadores por (i,j) si i<j o por (j,i) si i>j ((i,j) si i>j o por (j,i) si i<j para una matriz triangular superior).Given two identifiers i and j for a attribute, the position of the corresponding array will be determined the weight of similarity between the values corresponding to those identifiers by (i, j) if i <j or by (j, i) if i> j ((i, j) if i> j or by (j, i) if i <j for a triangular matrix higher).

En la comparación de dos valores pertenecientes al mismo atributo, hay tres posibles casos de uso del almacén de comparaciones.In comparing two values belonging to the same attribute, there are three possible use cases for the comparisons.

Primero, si los identificadores i y j de dos valores son iguales no hará falta comparar puesto que la similitud será total y se asignará el máximo peso de similitud.First, if the identifiers i and j of two values are equal, no need to compare since the similarity it will be total and the maximum weight of similarity will be assigned.

Segundo, si los identificadores i y j son distintos y alguno de ellos tiene un valor superior a D, no existe una posición asociada en el almacén de similitudes. Por tanto, se hace necesario acceder al diccionario para obtener los valores correspondientes a los identificadores y realizar su comparación.Second, if the identifiers i and j are different and some of them have a value greater than D, there is no an associated position in the similarity store. Therefore, makes it necessary to access the dictionary to get the values corresponding to the identifiers and make your comparison.

Tercero, si los identificadores i y j son distintos y ambos tienen un valor inferior a D, existe una posición asociada en el almacén de similitudes. Existen dos casos: (i) si el valor de esta posición es nulo, se hace necesario acceder al diccionario para obtener los valores correspondientes para los identificadores y realizar su comparación y se almacena el resultado de la comparación en la posición asociada del almacén de similitudes; (ii) si el valor de esta posición es no nulo, se entenderá que ese es el peso de similitud asociado a los dos valores y, por tanto, no será necesario compararlos.Third, if the identifiers i and j are different and both have a value less than D, there is a position associated in the similarity store. There are two cases: (i) if the value of this position is null, it is necessary to access the dictionary to obtain the corresponding values for the identifiers and make their comparison and the result of the comparison in the associated position of the warehouse similarities; (ii) if the value of this position is non-zero, you will understand that this is the weight of similarity associated with the two values and therefore do not need to be compared.

State of the art

No se tiene constancia de que se haya propuesto un procedimiento de este tipo para ahorrar comparaciones de valores. En todo caso, en el ámbito de Record Linkage se han propuesto distintas técnicas que reducen el número de comparaciones entre registros, mediante técnicas de agrupación, ordenación o filtrado. Un ejemplo de técnicas de este estilo es blocking, donde los registros se ordenan por un criterio de ordenación y sólo se comparan aquellos que coinciden en el valor del criterio de ordenación. Estas técnicas persiguen objetivos muy distintos que los del procedimiento aquí propuesto. La técnica de blocking persigue una reducción en el número de comparaciones de registros, realizando en cada una de ellas la comparación individual de los valores de todos sus atributos. En cambio, el procedimiento propuesto persigue evitar repetir el mayor número posible de comparaciones de valores, almacenando los pesos de similitud en el almacén de similitudes. Esto último es independiente de las comparaciones entre registros realizadas. Al igual que con blocking, lo mismo ocurre con técnicas como sliding window, canopy clustering, bigram indexing u otras técnicas basadas en la comparación de registros.No such procedure is known to have been proposed to save value comparisons. In any case, in the field of Record Linkage , different techniques have been proposed that reduce the number of comparisons between records, by means of grouping, ordering or filtering techniques. An example of techniques of this style is blocking , where records are ordered by a sort order and only those that match the value of the sort order are compared. These techniques pursue very different objectives than those of the procedure proposed here. The blocking technique seeks to reduce the number of record comparisons, performing in each one of them the individual comparison of the values of all its attributes. Instead, the proposed procedure seeks to avoid repeating the largest possible number of value comparisons, storing the similarity weights in the similarity store. The latter is independent of the comparisons between records made. As with blocking , the same happens with techniques such as sliding window, canopy clustering, bigram indexing or other techniques based on the comparison of records.

Blocking se propuso en H. Newcombe, Record Linking the design of efficient systems for linking records into individuals and family histories, American Journal of Human Genetics 19 no. 3, 1967. Sliding Window, se propuso en S. Stolfo, M. Hernandez, The Merge/Purge problem for Large Databases, ACM SIGMOD conference, pp. 127-138, 1995. Canopy clustering se propuso en A. McCallum, et al., Efficient clustering of high dimensional data sets with application to reference matching, ACM SIGKDD, pp. 169-178, 2000. Bigram indexing se propuso en P. Christen and T. Churches, Feberl: Freely extensible biomedical record linkage manual, release 0.2 edition, April 2003. Blocking was proposed in H. Newcombe, Record Linking the design of efficient systems for linking records into individuals and family histories, American Journal of Human Genetics 19 no. 3, 1967. Sliding Window , proposed in S. Stolfo, M. Hernandez, The Merge / Purge problem for Large Databases, ACM SIGMOD conference, pp. 127-138, 1995. Canopy clustering was proposed in A. McCallum, et al ., Efficient clustering of high dimensional data sets with application to reference matching, ACM SIGKDD, pp. 169-178, 2000. Bigram indexing was proposed in P. Christen and T. Churches, Feberl: Freely extensible biomedical record linkage manual, release 0.2 edition, April 2003.

Considerations

El procedimiento aquí descrito compara dos valores una sola vez y usa el peso de similitud guardado en el almacén de similitudes para posteriores comparaciones.The procedure described here compares two values only once and uses the similarity weight stored in the store of similarities for later comparisons.

El objetivo de la invención descrita es ahorrar tiempo de ejecución a cambio de almacenar valores de comparación para parejas de valores en un almacén de similitudes y un diccionario.The aim of the described invention is to save runtime in exchange for storing comparison values for pairs of values in a store of similarities and a dictionary.

La presente invención tiene como ventaja que puede usarse en cualquier proceso basado en la comparación de registros. Persigue evitar la repetición de la comparación de una misma pareja de valores, que tiene una posición asociada en el almacén de similitudes. Esto ahorra una gran cantidad de tiempo en la ejecución de los programas.The present invention has the advantage that can be used in any process based on the comparison of records. It seeks to avoid repeating the comparison of a same pair of values, which has an associated position in the store of similarities. This saves a great deal of time on the execution of the programs.

El procedimiento permite que el almacén de similitudes sea escalable en función del tamaño de memoria disponible y del número de atributos con un coste elevado de comparación de sus valores.The procedure allows the warehouse of similarities be scalable based on memory size available and the number of attributes with a high cost of comparison of their values.

EjemploExample

Use of the procedure in an environment of Statistical File Merger (FEF, hereinafter), Record Linkage , in English

El proceso de la FEF permite detectar distintas instancias de una misma entidad que aparecen repetidas en uno o más ficheros y que no se pueden detectar de forma trivial por errores en la codificación de los valores de los atributos que las componen u omisiones de datos.The FEF process allows detecting different instances of the same entity that appear repeated in one or more files and cannot be trivially detected due to errors in the coding of the values of the attributes that compose them or omissions of data.

En este caso se hace necesaria la comparación probabilística de valores mediante funciones de comparación aproximada, obteniendo un peso de similitud entre los valores.In this case the comparison is necessary probabilistic values using comparison functions approximate, obtaining a weight of similarity between the values.

La comparación de tuplas usada como hilo conductor en la explicación del procedimiento defendido en este documento, puede aplicarse a la FEF.The tuple comparison used as a thread driver in the explanation of the procedure defended in this document, can be applied to the FEF.

Claims

1. Procedure to reduce the number of individual comparisons of values in a process in which it is necessary to obtain two by two similarity distances between values or records characterized in that it comprises the steps of:

Creation of a dictionary of values, assigning an identifier for each different value depending on its appearance in reading all the tuples to compare.

Quantification of the frequency of appearance of each distinct value in the set of tuples to compare.

Rewriting the tuples to compare replacing each value with its identifier.

Dictionary sort by frequency appearance of each value.

Reassignment of identifiers replacing the previous identifier by the new identifier dependent on the frequency of occurrence of the value.

Initialization of the store of similarities that will store the similarity weights obtained from the comparison of pairs of values.

Tuples comparison pair to pair consulting, when possible, the store of similarities.

2. Procedure to reduce the number of individual comparisons of values in a process in which it is necessary to obtain two-to-two similarity distances between values or records according to claim 1, characterized in that the dictionary is implemented with any type of data structure and with any method of creating and maintaining dictionaries.

3. Procedure to reduce the number of individual comparisons of values in a process in which it is necessary to obtain distances of two by two similarity between values or records according to claim 1, characterized in that the store of similarities is implemented with any type of structure data and with any method of creating and maintaining data structures.

4. Method for reducing the number of individual comparisons of values in a process in which it is necessary to obtain distances of two by two similarity between values or records according to claim 1, characterized in that the data structures are implemented in software.

5. Procedure to reduce the number of individual comparisons of values in a process in which it is necessary to obtain distances of similarity two by two between values or registers according to claim 1, characterized in that the data structures are implemented in hardware or electronic circuits .

6. Procedure to reduce the number of individual comparisons of values in a process in which it is necessary to obtain distances of similarity two by two between values or records according to claim 3, characterized in that the store of similarities stores any fraction of the space of pairs of values.

7. Procedure to reduce the number of individual comparisons of values in a process in which it is necessary to obtain two-to-two similarity distances between values or records according to claim 1, characterized in that it is used in combination with any probabilistic or deterministic method of comparison of records.

8. Procedure to reduce the number of individual comparisons of values in a process in which it is necessary to obtain distances of two by two similarity between values or records according to claim 1, characterized in that it is used in Statistical Fusion of Files or Record processes Linkage , in English.

9. Procedure to reduce the number of individual comparisons of values in a process in which it is necessary to obtain two-to-two similarity distances between values or records according to claim 1, characterized in that it is used in any algorithmic method in which they have to compare values that appear with frequency greater than 1.

10. Procedure to reduce the number of individual comparisons of values in a process in which it is necessary to obtain two-to-two similarity distances between values or records according to claim 1, characterized in that it is implemented in a sequential computer, or with a single processor.

11. Procedure to reduce the number of individual comparisons of values in a process in which it is necessary to obtain two by two similarity distances between values or records according to claim 1, characterized in that it is implemented in a parallel computer, or with multiple processors , with shared or distributed memory
buida.

12. Procedure for reducing the number of individual comparisons of values in a process in which it is necessary to obtain distances of two-to-two similarity between values or records according to claim 1, characterized in that it is used in data processing computer applications.

13. Procedure to reduce the number of individual comparisons of values in a process in which it is necessary to obtain distances of two by two similarity between values or records according to claim 1, characterized in that it is implemented on any method of reducing the number of comparisons between records such as, for example, blocking, sliding window, canopy clustering or bigram indexing among others.