HK1211109B

HK1211109B - Profiling data with location information

Info

Publication number: HK1211109B
Application number: HK15111884.9A
Authority: HK
Inventors: 阿伦‧安德森
Original assignee: 起元科技有限公司
Priority date: 2012-10-22
Filing date: 2013-08-02
Publication date: 2020-02-07

Description

Analyzing data using location information

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请要求享有2012年10月22日提交的美国申请号61/716,766的优先权，该申请通过引用合并于此。This application claims priority to U.S. Application No. 61/716,766, filed October 22, 2012, which is incorporated herein by reference.

背景技术Background Art

本说明书涉及采用位置信息剖析(profile)数据。This specification relates to profiling data using location information.

存储的数据集经常包括各种特性未知的数据。数据集中的数据可以被组织为具有用于不同字段也称为“属性”或“列”)的值的记录。字段内的值可以包括字符串、数字或根据该字段的相关数据格式信息而编码或格式化而成的任何数据(包括可能无效的值)。在一些情况下，用于字段的数据格式信息是已知的，但是在该字段中显现的真实值可能不是已知的。例如，数据集内记录间字段的值范围或典型值、数据集内记录的不同字段之间的关系、或不同字段中的值的依赖性(dependency)可能是未知的。数据剖析涉及检查数据集的来源，从而确定所述特性。Stored data sets often include data of various unknown characteristics. The data in a data set can be organized into records with values for different fields (also called "attributes" or "columns"). The values within a field can include strings, numbers, or any data encoded or formatted according to the relevant data format information for the field (including values that may be invalid). In some cases, the data format information for a field is known, but the actual value that appears in the field may not be known. For example, the value range or typical value of a field between records in a data set, the relationship between different fields of records in a data set, or the dependency of values in different fields may be unknown. Data profiling involves examining the source of the data set to determine such characteristics.

发明内容Summary of the Invention

在一个方案中，通常，对存储在至少一个数据存储系统中的数据进行剖析的方法包括：经与数据存储系统连接的接口上，访问存储在数据存储系统中的至少一个记录集合；对所述记录集合(collection)进行处理，以产生结果信息，所述结果信息表征所述记录集合的一个或多个特定字段中所出现的值。所述处理包括：对出现在集合中记录的第一组一个或多个字段上的第一组特异值(distinct values)，产生相应的位置信息，所述位置信息针对第一组特异值中的每个特异值识别出现该特异值的所有每条记录，对所述第一组一个或多个字段，产生相应的条目(entry)列表，每个条目从第一组特异值中识别一个特异值以及该特异值的位置信息，对不同于所述第一组一个或多个字段的、集合中记录的第二组一个或多个字段，产生相应的条目列表，每个条目从所述第二组一个或多个字段中所出现的第二组特异值中确定一个特异值；以及至少部分基于下述产生表征出现在所述记录集合的一个或多个特定字段中的值的结果信息：采用出现在所述第一组一个或多个字段的至少一个值的位置信息对所述记录集合中的至少一个记录进行定位，确定在该被定位记录的所述第二组一个或多个字段上所出现的至少一个值。In one embodiment, generally, a method for analyzing data stored in at least one data storage system includes: accessing at least one record collection stored in the data storage system via an interface connected to the data storage system; processing the record collection to generate result information, wherein the result information represents the values appearing in one or more specific fields of the record collection. The processing includes: generating corresponding position information for a first set of distinct values appearing in a first set of one or more fields of records in a set, the position information identifying, for each distinct value in the first set of distinct values, all records in which the distinct value appears; generating a corresponding list of entries for the first set of one or more fields, each entry identifying a distinct value from the first set of distinct values and position information for the distinct value; generating a corresponding list of entries for a second set of one or more fields of records in the set that is different from the first set of one or more fields, each entry identifying a distinct value from a second set of distinct values appearing in the second set of one or more fields; and generating result information representing values appearing in one or more particular fields of the set of records based at least in part on: locating at least one record in the set of records using the position information for at least one value appearing in the first set of one or more fields, and determining at least one value appearing in the second set of one or more fields of the located record.

这些方案可以包括以下特征。These solutions may include the following features.

每个条目进一步识别对在一组一个或多个字段中出现特异值的记录的数量的计数。Each entry further identifies a count of the number of records in which a unique value occurs in a set of one or more fields.

所述处理进一步包括通过所识别的计数，对在每个列表中的条目进行分类(sort)。The processing further includes sorting the entries in each list by the identified count.

所述处理进一步包括：对于所述第二组特异值，产生相应的位置信息，所述位置信息针对第二组特异值中的每个特异值识别出现该特异值的所有每条记录。其中，对于对应于所述第二组一个或多个字段的列表，从所述第二组特异值中识别出一个特异值的每个条目包括该特异值的位置信息。The processing further includes generating, for each of the second set of singular values, corresponding location information, the location information identifying, for each singular value in the second set of singular values, all records in which the singular value occurs. Each entry in the list corresponding to the second set of one or more fields in which a singular value is identified includes the location information for the singular value.

所述处理进一步包括：对于一组特异值对，产生相应的位置信息，其中其中每对值中的第一值出现在所述记录的所述第一组一个或多个字段中，每对值中的第二值出现在所述记录的所述第二组一个或多个字段中，所述位置信息对于每个特异值对识别出现该特异值对的所有每条记录。The processing further includes: generating corresponding position information for a set of unique value pairs, wherein the first value in each pair of values appears in the first set of one or more fields of the record, and the second value in each pair of values appears in the second set of one or more fields of the record, and the position information identifies all each record in which the unique value pair appears for each unique value pair.

从该组特异值对中产生特异值对的位置信息包括：确定来自所述第一组特异值的第一特异值的位置信息与来自所述第二组特异值的第二特异值的位置信息之间的交集。Generating position information of a singular value pair from the set of singular value pairs includes determining an intersection between position information of a first singular value from the first set of singular values and position information of a second singular value from the second set of singular values.

确定所述交集包括：采用所述第一特异值的位置信息以定位所述集合中的记录，使用该被定位的记录来确定所述第二特异值。Determining the intersection includes: using the position information of the first singular value to locate a record in the set, and using the located record to determine the second singular value.

该方法进一步包括：根据在每个列表的条目中所识别的特异值数量对一组多个列表进行分类，该一组多个列表包括对应于所述第一组一个或多个字段的列表和对应于所述第二组一个或多个字段的列表。The method further includes sorting a plurality of lists according to a number of unique values identified in entries of each list, the plurality of lists including lists corresponding to the first set of one or more fields and lists corresponding to the second set of one or more fields.

所述处理进一步包括：对于一组特异值对，产生相应的位置信息，其中每对值中的第一值出现在所述记录的所述第一组一个或多个字段中，以及每对值中的第二值出现在所述记录的第二组一个或多个字段中，所述第二组一个或多个字段不同于所述第一组一个或多个字段，所述位置信息对于每个特异值对识别出现该特异值对的所有每条记录；以及对于该组特异值对，产生相应的条目列表，其中每个条目从该组特异值对中识别一特异值对和该特异值对的位置信息。The processing further includes: generating corresponding position information for a set of unique value pairs, wherein the first value in each pair of values appears in the first set of one or more fields of the record, and the second value in each pair of values appears in the second set of one or more fields of the record, and the second set of one or more fields is different from the first set of one or more fields, and the position information identifies, for each unique value pair, all records in which the unique value pair appears; and generating a corresponding list of entries for the set of unique value pairs, wherein each entry identifies a unique value pair from the set of unique value pairs and the position information of the unique value pair.

所述位置信息识别用于出现该特异值的所有每条记录的唯一索引值。The location information identifies a unique index value for each record in which the unique value occurs.

所述位置信息通过存储特定的唯一索引值来识别该特定的唯一索引值。The location information identifies a specific unique index value by storing the specific unique index value.

所述位置信息通过对所述位置信息内的唯一索引值进行编码来识别该唯一索引值。The location information identifies the unique index value by encoding the unique index value within the location information.

所述对唯一索引值进行编码包括在对应于该唯一索引值的矢量内的一位置处存储一个位。Encoding the unique index value includes storing a bit at a position within the vector corresponding to the unique index value.

所述集合包括第一记录子集和第二记录子集，该第一记录子集具有包括所述第一组一个或多个字段的字段，该第二记录子集具有包括所述第二组一个或多个字段的字段。The collection includes a first subset of records having fields including the first set of one or more fields and a second subset of records having fields including the second set of one or more fields.

所述处理进一步包括生成在以下之间提供映射的信息：1)所述第一记录子集的字段索引值，其将唯一索引值与第一子集中的所有每条记录关联在一起；与2)所述第二记录子集的一个字段的键值，其将键值与所述第二子集值中的所有每条记录关联在一起；其中所述键值将所述第二子集中的记录与所述第一子集中的记录联系起来。The processing further includes generating information that provides a mapping between: 1) a field index value of the first subset of records, which associates a unique index value with each record in the first subset; and 2) a key value of a field of the second subset of records, which associates a key value with each record in the second subset; wherein the key value links records in the second subset to records in the first subset.

所述位置信息识别出现特异值的所有每条记录的唯一索引值。The location information identifies a unique index value for each record in which the unique value occurs.

在另一方案中，一种计算机程序，存储在计算机可读介质中，用于对存储在至少一个数据存储系统中的数据进行剖析，该计算机程序包括使计算系统进行以下的指令：经与该数据存储系统连接的接口，访问存储在该数据存储系统中的至少一个记录集合；和对所述记录集合进行处理，以产生结果信息，所述结果信息表征所述记录集合的一个或多个特定字段中所出现的值。所述处理包括：对出现在集合中记录的第一组一个或多个字段上的第一组特异值，产生相应的位置信息，所述位置信息针对第一组特异值中的每个特异值识别出现该特异值的所有每条记录，对所述第一组一个或多个字段，产生相应的条目列表，每个条目从第一组特异值中识别一个特异值以及该特异值的位置信息，对不同于所述第一组一个或多个字段的、集合中记录的第二组一个或多个字段，产生相应的条目列表，每个条目从所述第二组一个或多个字段中所出现的第二组特异值中确定一个特异值，和至少部分基于下述，产生表征出现在所述记录集合的一个或多个特定字段中的值的结果信息：采用出现在所述第一组一个或多个字段的至少一个值的位置信息对所述记录集合中的至少一个记录进行定位，确定在该被定位记录的所述第二组一个或多个字段上所出现的至少一个值。In another embodiment, a computer program, stored in a computer-readable medium, is used to analyze data stored in at least one data storage system, the computer program including instructions for causing a computing system to: access at least one record set stored in the data storage system via an interface connected to the data storage system; and process the record set to generate result information, wherein the result information represents the values appearing in one or more specific fields of the record set. The processing includes: generating corresponding position information for a first set of unique values appearing on a first set of one or more fields of records in a set, the position information identifying, for each unique value in the first set of unique values, all records in which the unique value appears, generating a corresponding list of entries for the first set of one or more fields, each entry identifying a unique value from the first set of unique values and position information for the unique value, generating a corresponding list of entries for a second set of one or more fields of records in the set that is different from the first set of one or more fields, each entry identifying a unique value from a second set of unique values appearing in the second set of one or more fields, and generating result information representing values appearing in one or more specific fields of the record set based at least in part on: locating at least one record in the record set using the position information of at least one value appearing in the first set of one or more fields, and determining at least one value appearing in the second set of one or more fields of the located record.

在另一方案中，一种计算系统，用于对存储在至少一个数据存储系统中的数据进行剖析，该计算系统包括：与该数据存储系统连接的接口，其用于访问存储在该数据存储系统中的至少一个记录集合；和至少一个处理器，用于处理该记录集合，以产生结果信息，所述结果信息表征所述记录集合的一个或多个特定字段中所出现的值。所述处理包括：对出现在集合中记录的第一组一个或多个字段上的第一组特异值，产生相应的位置信息，所述位置信息针对第一组特异值中的每个特异值识别出现该特异值的所有每条记录，对所述第一组一个或多个字段，产生相应的条目列表，每个条目从第一组特异值中识别一个特异值以及该特异值的位置信息，对不同于所述第一组一个或多个字段的、集合中记录的第二组一个或多个字段，产生相应的条目列表，每个条目从所述第二组一个或多个字段中所出现的第二组特异值中确定一个特异值，以及至少部分基于下述，产生表征出现在所述记录集合的一个或多个特定字段中的值的结果信息：采用出现在所述第一组一个或多个字段的至少一个值的位置信息对所述记录集合中的至少一个记录进行定位，确定在该被定位记录的所述第二组一个或多个字段上所出现的至少一个值。In another embodiment, a computing system is provided for analyzing data stored in at least one data storage system, the computing system comprising: an interface connected to the data storage system for accessing at least one record set stored in the data storage system; and at least one processor for processing the record set to generate result information, wherein the result information represents values appearing in one or more specific fields of the record set. The processing includes: generating corresponding position information for a first set of unique values appearing on a first set of one or more fields of records in a set, the position information identifying, for each unique value in the first set of unique values, all records in which the unique value appears; generating a corresponding list of entries for the first set of one or more fields, each entry identifying a unique value from the first set of unique values and position information for the unique value; generating a corresponding list of entries for a second set of one or more fields of records in the set that is different from the first set of one or more fields, each entry identifying a unique value from a second set of unique values appearing in the second set of one or more fields; and generating result information representing values appearing in one or more specific fields of the record set based at least in part on: locating at least one record in the record set using the position information of at least one value appearing in the first set of one or more fields, and determining at least one value appearing in the second set of one or more fields of the located record.

在另一方案中，一种计算系统，对存储在至少一个数据存储系统中的数据进行剖析，该计算系统包括：访问存储在该数据存储系统中的至少一个记录集合的装置；以及，处理记录集合的装置，以产生结果信息，所述结果信息表征所述记录集合的一个或多个特定字段中所出现的值。所述处理包括：对出现在集合中记录的第一组一个或多个字段上的第一组特异值，产生相应的位置信息，所述位置信息针对第一组特异值中的每个特异值识别出现该特异值的所有每条记录，对所述第一组一个或多个字段，产生相应的条目列表，每个条目从第一组特异值中识别一个特异值以及该特异值的位置信息，对不同于所述第一组一个或多个字段的、集合中记录的第二组一个或多个字段，产生相应的条目列表，每个条目从所述第二组一个或多个字段中所出现的第二组特异值中确定一个特异值，以及至少部分基于下述，产生表征出现在所述记录集合的一个或多个特定字段中的值的结果信息：采用出现在所述第一组一个或多个字段的至少一个值的位置信息对所述记录集合中的至少一个记录进行定位，确定在该被定位记录的所述第二组一个或多个字段上所出现的至少一个值。In another embodiment, a computing system is provided for analyzing data stored in at least one data storage system, the computing system comprising: a device for accessing at least one record set stored in the data storage system; and a device for processing the record set to generate result information representing values appearing in one or more specific fields of the record set. The processing includes: generating corresponding position information for a first set of unique values appearing on a first set of one or more fields of records in a set, the position information identifying, for each unique value in the first set of unique values, all records in which the unique value appears; generating a corresponding list of entries for the first set of one or more fields, each entry identifying a unique value from the first set of unique values and position information for the unique value; generating a corresponding list of entries for a second set of one or more fields of records in the set that is different from the first set of one or more fields, each entry identifying a unique value from a second set of unique values appearing in the second set of one or more fields; and generating result information representing values appearing in one or more specific fields of the record set based at least in part on: locating at least one record in the record set using the position information of at least one value appearing in the first set of one or more fields, and determining at least one value appearing in the second set of one or more fields of the located record.

在另一方案中，对存储在至少一个数据存储系统中的数据进行剖析的方法、计算机可读介质和系统，包括：经与数据存储系统连接的接口，访问存储在数据存储系统中的至少一个记录集合；对记录集合进行处理，从而产生结果信息，其表征在记录集合的一个或多个特定字段中出现的值。所述处理包括：对于第一组两个或多个字段的第一组特异值，产生对应的条目列表，每个条目确定在集合内记录的第一组两个或多个字段中所出现的值的特异值组合，以及该特异值组合的剖析信息；以及至少部分基于下述产生结果信息，该结果信息表征该记录集合的一个或多个特定字段所出现的值：对第一组两个或多个字段中出现的至少两个特异值组合，从条目列表中组合剖析信息，并基于经组合的剖析信息，对出现在至少一个或多个特定字段的至少一个值确定剖析信息。In another aspect, a method, computer-readable medium, and system for profiling data stored in at least one data storage system includes: accessing at least one set of records stored in the data storage system via an interface connected to the data storage system; processing the set of records to generate result information representing values occurring in one or more specific fields of the set of records. The processing includes: generating a list of entries corresponding to a first set of unique values for a first set of two or more fields, each entry identifying a unique value combination of values occurring in the first set of two or more fields of a record in the set and profile information for the unique value combination; and generating the result information representing the values occurring in the one or more specific fields of the set of records based at least in part on: combining profile information from the list of entries for at least two unique value combinations occurring in the first set of two or more fields, and determining profile information for at least one value occurring in the at least one of the one or more specific fields based on the combined profile information.

这些方案可以包括一个或多个下述优点。These aspects may include one or more of the following advantages.

一些数据剖析程序通过对数据集记录的一个域内的特异值的统计(census)进行编译，来计算数据集的数据质量的量度(measure)，其中“域”包括一个或多个字段、字段的组合、数据集记录的字段片段。当对某域的统计进行编译时，统计数据被存储，其列举该域的特异值组，并包括为具有每个特异值的记录的数量计数。例如，统计数据可被设置为一个针对选定域的值计数条目列表，每个值计数条目包括出现在选定域内的特异值，以及该特异值出现在选定域中的记录次数。在一些实施方式中，每个字段为独立的域。在一些实施方式中，统计数据以单个数据集存储，可选地，通过字段进行索引从而可以快速随机访问，而在其他实施方式中，统计数据可以以多个数据集存储，例如一个字段一个。Some data profiling programs calculate a measure of the data quality of a dataset by compiling a census of outliers within a domain of the dataset's records, where a "domain" includes one or more fields, combinations of fields, or segments of fields of the dataset's records. When statistics are compiled for a domain, statistics are stored that enumerate the set of outliers for that domain and include a count of the number of records with each outlier. For example, the statistics may be set as a list of value count entries for a selected domain, each value count entry including the outlier that appears in the selected domain and the number of records in the selected domain that the outlier appears in. In some embodiments, each field is a separate domain. In some embodiments, the statistics are stored in a single dataset, optionally indexed by field for fast random access, while in other embodiments, the statistics may be stored in multiple datasets, e.g., one per field.

数据质量的量度可以包括特异值的数量和分布，根据特定验证规则的有效或无效值的数量和分布、另一组一个或多个字段的值被保持为固定时一组一个或多个字段的值的数量和分布(也称为“分段”)，以及两个或多个字段值之间的相关性(也称为“函数依赖”)。每次计算特定量度时，通过对数据集中的数据进行处理，可以采用适当的统计。然而，在一些情况下，例如当计算组合字段的数据质量量度时，并不要求对全部数据进行字次处理，而是采用存储的已经对单个字段进行计算的统计数据对字段组合进行计算。Measures of data quality may include the number and distribution of outliers, the number and distribution of values that are valid or invalid according to a particular validation rule, the number and distribution of values for one or more fields when the values for another set of one or more fields are held fixed (also known as "segments"), and the correlation between values for two or more fields (also known as "functional dependencies"). Each time a particular measure is calculated, appropriate statistics may be applied to the data in the dataset. However, in some cases, such as when calculating a data quality measure for a combination of fields, it is not necessary to perform a complete data processing, but rather to use stored statistics that have been calculated for individual fields to calculate the combination of fields.

在一些实施方式中，选定域的统计数据包括位置信息，其针对统计数据中的每个特异值，确定其中的该特异值出现在选定域中的每个记录。仅仅对全部数据计算一次位置信息。可以直接从现存的具有位置信息的统计数据对涉及字段组合的数据质量量度(尤其是，涉及包含多个字段的分段、相关性、或验证规则的量度)进行随后的评价，而不需要返回至存储数据集记录的源从而计算新的统计数据。这使计算其它的数据质量量度高效地多。另外，具有位置信息的统计数据可用于向下钻取数据质量结果，即返回与数据质量结果相关的底层(underlying)数据记录，例如无效记录或主键字段的重复记录。如果不同数据集的域被剖析，可以采用索引地图，以避免为了关联不同数据集的记录而需要进行联合运算。In some embodiments, the statistics for a selected domain include position information that, for each outlier in the statistics, identifies each record in the selected domain in which the outlier occurs. The position information is calculated only once for all the data. Subsequent evaluations of data quality metrics involving combinations of fields (particularly, metrics involving segmentation, correlation, or validation rules involving multiple fields) can be performed directly from existing statistics with position information, without having to return to the source storing the dataset records to calculate new statistics. This makes the calculation of other data quality metrics much more efficient. In addition, statistics with position information can be used to drill down to data quality results, i.e., return the underlying data records related to the data quality results, such as invalid records or duplicate records for primary key fields. If domains of different datasets are profiled, an index map can be used to avoid the need for join operations to associate records from different datasets.

根据下述描述和权利要求，本发明的其他特点和优势将显而易见。Other features and advantages of the invention will be apparent from the following description and from the claims.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是用于剖析数据的系统的框图。FIG1 is a block diagram of a system for profiling data.

图2A为数据剖析程序的操作和数据的示意图；FIG2A is a schematic diagram of the operation and data of a data analysis program;

图2B为数据剖析程序的流程图；FIG2B is a flow chart of the data analysis procedure;

图3为供函数依赖结果而产生的数据的示意图Figure 3 is a schematic diagram of the data generated by the function dependency result

图4为组合统计程序的流程图；FIG4 is a flow chart of the combined statistical program;

图5为供组合统计而产生的数据的示意图；FIG5 is a schematic diagram of data generated for combined statistics;

图6A为确定函数依赖的程序的流程图；FIG6A is a flow chart of a procedure for determining functional dependencies;

图6B为具有函数依赖信息的组合统计的示例；FIG6B is an example of combined statistics with functional dependency information;

图7为边缘下钻程序的示意图；FIG7 is a schematic diagram of an edge drill-down procedure;

图8A为索引地图数据的示意图；FIG8A is a schematic diagram of index map data;

图8B为函数依赖结果数据的示意图；FIG8B is a schematic diagram of function dependency result data;

图9为节点下钻程序的示意图；FIG9 is a schematic diagram of a node drill-down procedure;

图10为分段统计和分段组合统计的数据示意图；FIG10 is a schematic diagram of segmented statistics and segmented combined statistics;

图11为用于分段立方体(segment data)的数据的示意图。FIG. 11 is a schematic diagram of data for segment data.

具体实施方式DETAILED DESCRIPTION

图1示出示例性数据处理系统100，其中可以使用数据剖析技术。系统100包括数据源102，数据源102可以包括诸如存储装置或连接至在线数据流的一个或多个数据源，其每个可以以任意多种存储格式(例如，数据库表、电子数据表文件、纯文本文件、或主机所使用的本机格式)来存储数据。执行环境104包括剖析模块106和处理模块108。所述执行环境104可能被托管在受合适的操作系统(诸如UNIX操作系统)控制的一个或多个通用计算机上。例如，所述执行环境104可包括多节点并行计算机系统，该多节点并行计算机系统包括使用多个中央处理器(CPU)的计算机系统的配置，所述中央处理器可以是本地CPU(例如多处理器系统，如SMP计算机)、或本地分布式CPU(例如多个处理器耦合为集群或MPP处理器)、或远程CPU、或远程分布式CPU(例如通过局域网(LAN)或广域网(WAN)来耦合的多个处理器)、或其组合。FIG1 illustrates an exemplary data processing system 100 in which data profiling techniques may be employed. System 100 includes a data source 102, which may include one or more data sources, such as a storage device or connected to an online data stream, each of which may store data in any of a variety of storage formats (e.g., a database table, a spreadsheet file, a plain text file, or a native format used by a host). An execution environment 104 includes a profiling module 106 and a processing module 108. The execution environment 104 may be hosted on one or more general-purpose computers controlled by a suitable operating system (e.g., a UNIX operating system). For example, the execution environment 104 may include a multi-node parallel computer system comprising a configuration of a computer system using multiple central processing units (CPUs), wherein the central processing unit may be a local CPU (e.g., a multi-processor system such as an SMP computer), a locally distributed CPU (e.g., multiple processors coupled as a cluster or MPP processor), a remote CPU, a remotely distributed CPU (e.g., multiple processors coupled via a local area network (LAN) or a wide area network (WAN)), or a combination thereof.

剖析模块106从数据源102读取数据并且将剖析概要信息存储在由剖析模块106和处理模块108可访问的剖析数据存储110中。例如，剖析数据存储110可以被保存在数据源102的存储装置内，或者被保存在由执行环境104内可访问的单独的数据存储系统中。基于剖析概要信息，处理模块108能够对数据源102中的数据执行多种处理任务，包括清理数据、下载数据至另一个系统、或者管理对存储在数据源102中对象的访问。提供数据源102的存储装置对执行环境104而言可以是本地的，例如，被存储在连接至运行所述执行环境104的计算机的存储介质(例如，硬盘驱动器112)中、或者可以对执行环境104而言是远程的，例如，被托管在通过远程连接或服务(例如，由云计算基础设施提供)与运行所述执行环境104的计算机进行通信的远程系统(例如，主机114)上。The profiling module 106 reads data from the data source 102 and stores profiling profile information in a profiling data store 110 accessible by the profiling module 106 and the processing module 108. For example, the profiling data store 110 may be stored in the storage of the data source 102 or in a separate data storage system accessible from the execution environment 104. Based on the profiling profile information, the processing module 108 can perform various processing tasks on the data in the data source 102, including cleaning the data, downloading the data to another system, or managing access to objects stored in the data source 102. The storage providing the data source 102 may be local to the execution environment 104, for example, stored in a storage medium connected to the computer running the execution environment 104 (e.g., hard drive 112), or remote to the execution environment 104, for example, hosted on a remote system (e.g., host 114) that communicates with the computer running the execution environment 104 via a remote connection or service (e.g., provided by a cloud computing infrastructure).

剖析模块106能够读取存储在数据源102内的数据，并有效地进行各种类型的分析，包括例如基于函数依赖或分段而进行有利于计算数据质量量度的分析。在一些实施方式中，所述分析包括针对存储在数据源102中的数据集记录的每个独立字段产生统计数据，将该统计数据存储在剖析数据存储器110内。如上所述，特定数据集内记录的特定域的统计数据不仅包括用于出现在该域中的每个特异值的条目，还可以包括位置信息，其用于识别出现特异值的特定数据集中的记录的各个位置(例如，就记录索引值而言)。在一个实施方式中，在对关联值产生统计条目的过程中，矢量被赋予具有关联值的每个记录的唯一记录标识。如果数据集的原始数据内的记录不具有唯一的记录标识，可以产生所述记录标识，并作为剖析程序的一部分将其添加至记录中，例如对所有每条记录赋予一连续数字序列。之后，该位置信息可以包括在统计条目内，可用于生成其他的组合统计数据，以计算函数依赖或分段，如下所详述的。The profiling module 106 is capable of reading data stored in the data source 102 and efficiently performing various types of analysis, including, for example, analysis based on functional dependencies or segmentation to facilitate the calculation of data quality metrics. In some embodiments, the analysis includes generating statistical data for each individual field of a data set record stored in the data source 102, and storing the statistical data in the profile data store 110. As described above, the statistical data for a particular field of a record in a particular data set includes not only an entry for each unique value occurring in that field, but also positional information identifying the respective position (e.g., in terms of a record index value) of the record in the particular data set where the unique value occurs. In one embodiment, in generating statistical entries for associated values, a vector is assigned a unique record identifier for each record with the associated value. If the records in the raw data of the data set do not have unique record identifiers, such record identifiers can be generated and added to the records as part of the profiling process, for example, by assigning a continuous sequence of numbers to each record. This positional information can then be included in the statistical entries and used to generate other combined statistical data to calculate functional dependencies or segmentation, as described in more detail below.

还可以是其他的存储位置信息的实施方式，其中一些在性能和/或减少存储空间方面具有优势。例如，可以采用位矢量，而不采用记录标识的矢量。位矢量的每个位对应于特定的记录标识，如果具有对应记录标识的相关记录具有关联值，则设定一位。位矢量的位和记录标识之间的对应性可以是明确或者暗含的。例如，可以为明确的映射，其不一定是一对一的,其将位数字与对应的记录标识关联起来，或者可以为暗含的映射，其中每个位数字的位置对应于记录位置的顺序。在一些实施方式中，所得到的位矢量被压缩，以进一步节省存储。Other implementations of the storage location information are also possible, some of which have advantages in terms of performance and/or reduced storage space. For example, a bit vector can be used instead of a vector of record identifiers. Each bit of the bit vector corresponds to a specific record identifier, and a bit is set if the associated record with the corresponding record identifier has an associated value. The correspondence between the bits of the bit vector and the record identifier can be explicit or implicit. For example, there can be an explicit mapping, which is not necessarily a one-to-one mapping, which associates the bit digits with the corresponding record identifiers, or it can be an implicit mapping, where the position of each bit digit corresponds to the order of the record positions. In some embodiments, the resulting bit vector is compressed to further save storage.

位置信息还可以存储在位矢量的矢量内。例如，位矢量的每个位可能通过存储在交叉引用文件中的位位置与记录标识之间的映射而对应于相关的记录标识。位矢量的矢量内的位矢量条目的矢量索引可以用于隐含地对补充信息进行编码，该补充信息例如出现该值的字段或数据分块(例如当在多个数据分块内并行处理统计数据时)内的词的数量。明确的补充信息可以在另外的与位矢量相关或与位矢量的矢量内位矢量条目相关的字段中指定。该补充信息可以用于区分含有供之后使用的值的记录集。Position information can also be stored within a vector of bit vectors. For example, each bit of a bit vector may correspond to an associated record identifier via a mapping between bit positions and record identifiers stored in a cross-reference file. The vector index of a bit vector entry within a vector of bit vectors can be used to implicitly encode supplementary information, such as the number of words in the field or data partition (e.g., when processing statistics in parallel across multiple data partitions) in which the value appears. Explicit supplementary information can be specified in additional fields associated with the bit vector or with a bit vector entry within a vector of bit vectors. This supplementary information can be used to distinguish between sets of records containing the value for later use.

图2A的例子描述了，在数据剖析过程中剖析模块106在来自数据源102的一个或多个数据集上执行的操作，以及在该过程中接收和产生的数据。图2B为该过程的流程图。参照图2A和2B，剖析模块106执行索引操作200，以确保每个将被剖析的数据集201中的每条记录具有索引值，其为可以通过即将生成的位置信息而被引用的每条记录提供了明确的位置。例如，特定数据集的索引值可以为基于表的行数、定界文件内的位置、存储地址、主键值、或记录的其他任意唯一属性而分配给每个记录的不断增加的整数(例如，从1开始，每次增加1)。分配的索引值可以明确地添加至每条记录，以提供索引数据集203，例如将该值作为每条记录的字段添加至数据源102的原始数据集内，或者在数据源102或剖析数据存储器110中存储为新的数据集。对于原始数据集已经包括可以作为索引的字段的情况，索引操作200可以被跳过去，或者仅被执行为用于验证该字段作为索引的能力。索引操作200可以包括产生索引地图，其为一个数据集的索引与另一数据集的索引之间提供了对应关系，如下所详述的。The example of Figure 2A describes the operations performed by the analysis module 106 on one or more data sets from the data source 102 during the data analysis process, as well as the data received and generated during the process. Figure 2B is a flow chart of the process. Referring to Figures 2A and 2B, the analysis module 106 performs an index operation 200 to ensure that each record in each data set 201 to be analyzed has an index value, which provides a clear location for each record that can be referenced by the location information to be generated. For example, the index value of a particular data set can be an increasing integer (e.g., starting from 1 and increasing by 1 each time) assigned to each record based on the number of rows in the table, the position in the delimited file, the storage address, the primary key value, or any other unique attribute of the record. The assigned index value can be explicitly added to each record to provide an index data set 203, for example, by adding the value as a field for each record to the original data set of the data source 102, or by storing it as a new data set in the data source 102 or the analysis data storage 110. In the case where the original dataset already includes a field that can be used as an index, the indexing operation 200 can be skipped or performed only to verify the ability of the field to be used as an index. The indexing operation 200 can include generating an index map that provides a correspondence between the index of one dataset and the index of another dataset, as described in detail below.

剖析模块106进行统计操作205，其对每个选定域集计算具有位置信息的统计数据。在该示例中，每个域为单个字段。这样，在该示例中，统计操作205的结果为多个统计数据集207，其每一个是针对特定数据集的特定字段。每个数据集可以具有已经被指定用于剖析的一组字段，或者默认为每个数据集的所有字段都可以进行剖析。在其他例子中，域可以为字段的片段，或者多个字段或字段片段的组合。特定域的每个统计数据集(也称为“统计”)含有条目列表，包括域内出现的特异值、特定值出现的记录的数量的计数，特定值出现的相关位置信息。在一些实施方式中，该计数没有明确地包括在统计中，其原因在于，需要时其可以从位置信息中推导出(例如对出现该值的记录进行定位的位矢量中位数字的和可以产生该值出现的记录的数量)。在一些实施方式中，剖析模块106累积额外的信息，从而增加位置信息，例如证明或表征域内值的位置的信息。The profiling module 106 performs a statistical operation 205 that computes statistics with positional information for each selected set of domains. In this example, each domain is a single field. Thus, in this example, the result of the statistical operation 205 is a plurality of statistical data sets 207, each for a particular field of a particular dataset. Each dataset can have a set of fields already designated for profiling, or all fields of each dataset can be profiled by default. In other examples, a domain can be a fragment of a field, or a combination of multiple fields or field fragments. Each statistical data set (also referred to as a "statistic") for a particular domain contains a list of entries, including a count of unique values occurring within the domain, a count of the number of records in which the particular value occurs, and associated positional information regarding the occurrence of the particular value. In some embodiments, this count is not explicitly included in the statistic because it can be derived from the positional information if needed (e.g., the sum of the digits in the bit vector locating the records in which the value occurs can yield the number of records in which the value occurs). In some embodiments, the profiling module 106 accumulates additional information to augment the positional information, such as information that demonstrates or characterizes the location of the value within the domain.

剖析模块106接收指定想要的多域数据剖析结果的输入(例如用户输入)，可以在产生统计数据集207之后的一段较长时间内接收该输入。该输入还指定(明确或隐含)在计算中所涉及的多个域。为了计算多域数据剖析结果，剖析模块106针对特定域的各个字段选择统计集合209。一个类型的多域数据剖析结果是“组合统计”，其指定了在同一记录(即位于相同索引处)的各个字段中出现的唯一元组(tuple)值。其他类型的多域数据剖析结果包括函数依赖结果或分段结果，其每一个从组合统计的计算开始，如下所更详细描述的。The profiling module 106 receives input (e.g., user input) specifying a desired multi-domain data profiling result, which may be received at some time after generating the statistical data set 207. The input also specifies (explicitly or implicitly) the multiple domains involved in the calculation. To calculate the multi-domain data profiling result, the profiling module 106 selects a statistical set 209 for each field of a particular domain. One type of multi-domain data profiling result is a "combined statistic," which specifies the unique tuple values that appear in each field of the same record (i.e., at the same index). Other types of multi-domain data profiling results include functional dependency results or segmentation results, each of which begins with the calculation of a combined statistic, as described in more detail below.

可选地，剖析模块106可以对集合209进行分类(sort)，从而统计和每个统计内的统计条目处于这样的顺序，该顺序使得当前或以后的多域数据剖析结果的计算更加有效。在该示例中，每个统计被分类(210)，从而通过发生的次数，条目降序排列，因此，最常用的值排在第一。另外，在一些实施方式中，条目可以通过位置信息进行进一步(次分类)分类，比在数据集中具有相同发生次数的另一值较早首次出现的值在分类统计中更早出现。这使得统计值具有很明确的排序，其原因是，对于两个不同的值，一个值的首次出现(即最小的记录索引)必须不同于另一值的首次出现。还可以通过每个统计中特异值的数量对统计集合209进行分类(220)，并通过最常用值的计数以降序进行进一步分类。该分类产生了经分类的分类统计集合225，其中较短的统计(在特异值的数量方面)出现地更早，对于具有相同数量的特异值的两个统计,其最常见值具有较大计数的那个统计出现地越早。对于对应于函数依赖的多个域数据分类结果，更可能的是，特异值出现次数相对低的字段之间为函数依赖。当特异值的次数增加时，该字段更可能代表一唯一属性，例如主键，或相同值出现次数较低的属性，其有可能与其他值虚假相关。通过将较短统计排序在较长统计之前，更可能与函数依赖分析相关的字段将很快被处理。在一些情况下，甚至可以识别暂停条件，此时可以计算结果，而不继续对分类集合225中的所有统计的所有条目进行处理。对于除了函数依赖的多域数据剖析结果，可能需要对完整的分类集合225进行处理，从而计算结果，在该情况下，可以不需要分类。在该示例中，分类210发生在分类220之前，但是在其他例子中，分类220可以发生在分类210之前。Optionally, the profiling module 106 can sort the collection 209 so that the statistics and the statistical entries within each statistic are in an order that makes the calculation of current or future multi-domain data profiling results more efficient. In this example, each statistic is sorted (210) so that the entries are sorted in descending order by the number of occurrences, so that the most common values are ranked first. Additionally, in some embodiments, the entries can be further (sub-)sorted by position information, so that a value that first appears earlier than another value with the same number of occurrences in the data set appears earlier in the sorted statistic. This allows the statistical values to have a very clear ordering because, for two different values, the first occurrence (i.e., the smallest record index) of one value must be different from the first occurrence of the other value. The statistics collection 209 can also be sorted (220) by the number of unique values in each statistic and further sorted by the count of the most common values in descending order. This classification produces a sorted set of classification statistics 225, where shorter statistics (in terms of the number of outliers) appear earlier. For two statistics with the same number of outliers, the one with the larger count of the most common value appears earlier. For multi-domain data classification results corresponding to functional dependencies, fields with relatively low outlier counts are more likely to be functionally dependent. As the number of outliers increases, the field is more likely to represent a unique attribute, such as a primary key, or an attribute with a low number of identical values, which may be spuriously correlated with other values. By sorting shorter statistics before longer statistics, fields more likely to be relevant to functional dependency analysis are processed sooner. In some cases, a pause condition can even be identified, allowing calculation of the result without further processing of all entries in all statistics in classification set 225. For multi-domain data profiling results other than functional dependencies, it may be necessary to process the entire classification set 225 to calculate the result, in which case classification may not be necessary. In this example, classification 210 occurs before classification 220, but in other examples, classification 220 may occur before classification 210.

剖析模块106执行组合统计产生操作230，其中经分类的集合225中的统计条目被顺序读取，并与其他统计条目的信息组合，从而产生组合统计240。为了有效地将来自不同统计条目的信息组合,剖析模块106采用位置信息对索引数据集203中的记录进行定位,其与组合统计240的产生密切相关,如下详述。可以执行组合统计产生操作230的多个路径。例如，如果组合统计240元组包括来自两个以上字段的值，则在对该元组在组合统计240内构建统计条目时，剖析模块106可以在第一路径上对两个字段进行逐对组合，在之后的路径中，统计条目可以与之前形成的组合统计240版本中的任何条目组合。The profiling module 106 performs a combined statistics generation operation 230, in which the statistics entries in the sorted set 225 are sequentially read and combined with information from other statistics entries to generate combined statistics 240. To efficiently combine information from different statistics entries, the profiling module 106 uses location information to locate records in the indexed dataset 203, which is closely related to the generation of combined statistics 240, as described in detail below. Multiple paths of the combined statistics generation operation 230 can be performed. For example, if a combined statistics 240 tuple includes values from more than two fields, then when constructing statistics entries within combined statistics 240 for that tuple, the profiling module 106 may perform a pairwise combination of the two fields in a first path. In subsequent paths, the statistics entries may be combined with any entries in the previously formed version of combined statistics 240.

参见图3，描述了剖析程序的例子。A源数据集300被索引(在索引操作200中)，以提供A索引数据集310。A源数据集300具有三个字段，其对应于三个图示的列，具有6个记录，其对应于六个图示的行，数据集中的第一条记录具有值“d”、“q”、“d8”，分别对应三个字段。在记录的起始增加了替代键字段，其为不断增加的整数值，作为位置索引以唯一确定A索引数据集310中的每条记录。在该示例中，剖析模块106对A源数据集300的前两个字段计算经分类的A统计集合320。第一个字段(即第一列)名称为“g”，第二字段(即第二列)名称为“f”，第三字段的名称在该示例中无关紧要，其原因是在该示例中不对其进行剖析。这样，在A统计集合中具有两个统计，一个是对g字段(称为g－统计)，另一针对f字段(f－统计)。A统计集合320中的每个统计包括经分类的条目列表，其中每个条目包括一个值、该值出现的的数量的计数、表示该值位置信息的记录索引矢量。这样，对于f－统计，在该示例中以用空间定界(space-delimited)的字符串“q 3 A[1,4,5]”描述的第一统计条目表示了该值“q”在A源数据集300中出现3次，出现在A索引数据集310的第1、4、5条记录中，用矢量A[1，4，5]表示。每个字段的统计按照值出现的计数降序排序，并且按照位置信息矢量中的第一索引进一步进行升序排序。还对A统计集合320的统计集以每个统计中特异值的次数进行排列，将最短的统计放在第一位，在该示例中，将f字段的统计放在g字段的统计之前。Referring to FIG3 , an example of a profiling process is described. A source dataset 300 is indexed (in index operation 200 ) to provide an A index dataset 310 . A source dataset 300 has three fields, corresponding to the three illustrated columns, and six records, corresponding to the six illustrated rows. The first record in the dataset has values "d," "q," and "d8," corresponding to the three fields, respectively. A surrogate key field, a continuously increasing integer value, is added to the beginning of the record to serve as a positional index to uniquely identify each record in A index dataset 310 . In this example, profiling module 106 calculates a sorted A statistics set 320 for the first two fields of A source dataset 300 . The first field (i.e., the first column) is named "g," the second field (i.e., the second column) is named "f," and the name of the third field is irrelevant in this example because it is not profiled. Thus, there are two statistics in the A statistics set: one for the g field (referred to as g-statistics) and the other for the f field (f-statistics). Each statistic in the A statistics set 320 includes a sorted list of entries, where each entry includes a value, a count of the number of times the value appears, and a record index vector representing the location information of the value. Thus, for the f-statistic, the first statistic entry described in this example by the space-delimited string "q 3 A[1,4,5]" indicates that the value "q" appears three times in the A source dataset 300, appearing in the first, fourth, and fifth records of the A index dataset 310, represented by the vector A[1,4,5]. The statistics for each field are sorted in descending order by the value occurrence count and further sorted in ascending order by the first index in the location information vector. The statistics in the A statistics set 320 are also sorted by the number of unique values in each statistic, with the shortest statistic placed first. In this example, the statistics for the f field are placed before the statistics for the g field.

剖析模块106执行组合统计产生操作，以计算组合统计330。在该示例中，组合统计中的元组为一对值。该例中，用空间定界的字符串“g f d q 2 3 2 A[1,4]”表述的组合统计中的第一条目表示了,该对的第一字段为“g”，该对的第二字段为“f”，第一字段值为“d”，第二字段值为“q”，第一个值出现2次，第二个值出现三次，第一字段含有第一个值(即“d”的g值)和第二子段具有第二个值(即“q”的f值)的记录有2个，这些记录在A索引数据集310中为第1和4条，用矢量“A[1,4]”表示。之后，组合统计330可以用于计算各种数据剖析分析结果，如下详述。The profiling module 106 performs a combined statistics generation operation to calculate a combined statistics 330. In this example, a tuple in the combined statistics is a pair of values. In this example, the first entry in the combined statistics, represented by the space-delimited string "g f d q 2 3 2 A[1,4]", indicates that the first field of the pair is "g", the second field of the pair is "f", the first field value is "d", the second field value is "q", the first value occurs twice, the second value occurs three times, and there are two records in which the first field has the first value (i.e., the g value of "d") and the second field has the second value (i.e., the f value of "q"). These records are the first and fourth records in the A-indexed dataset 310, represented by the vector "A[1,4]". Subsequently, the combined statistics 330 can be used to calculate various data profiling analysis results, as described in detail below.

在一些实施方式中，基于组合统计的结果可以在用户接口上图形显示，例如上述例子中的函数依赖结果340。每个圆含有特异值，位于字段“g”和“f”的标记下方，该值的计数在圆圈旁边显示(在g字段的左边，在f字段的右边)。圆圈之间的每个定向边指示位于两端的一对值，边缘之上的计数为共享该对值的记录的数量。从不同的计数中，可以确定单个值和字段对的关联评价，并通过剖析模块106显示。在该示例中，剖析模块106显示了评价为“g确定f”.In some embodiments, results based on combined statistics can be graphically displayed on the user interface, such as the functional dependency results 340 in the above example. Each circle contains a unique value, located below the labels of the fields "g" and "f", and the count of the value is displayed next to the circle (to the left of the g field and to the right of the f field). Each directed edge between the circles indicates a pair of values at both ends, and the count above the edge is the number of records that share this pair of values. From the different counts, the associated evaluation of the individual value and field pairs can be determined and displayed by the analysis module 106. In this example, the analysis module 106 displays the evaluation as "g determines f".

在一些实施方式中，组合统计生成操作230能够产生一起出现(即位于同一记录中)在两个字段中的特异值对的组合统计，而不需要形成在字段的各独立统计中所出现的特异值的笛卡尔(Cartesian)乘积。该Cartesian乘积可以用于计算所述的组合统计，例如从Cartesian乘积获得所形成的所有成对值的位置信息，并计算相关位置信息的交集从而定位同时共享这两个值的记录。然而，采用整个Cartesian乘积的过程可能是无效的，这是因为，许多对的位置信息可能不重叠。图4的流程图和图5的示意图描述了可用于组合统计生成操作230的步骤，其能够有效地识别并组合在位置信息确实具有重叠的那些对，并避免位置信息不重叠的对。在该示例中，位置信息可以用记录索引矢量表示和描述，但是可以采用其他位置信息的表示方法对矢量进行计算。例如，对于f值位置信息“f-A[]”和g值位置信息“g-A[]”，可以获得“f-A[]”和“g-A[]”的交集，两个矢量都用位矢量表示，分别对位矢量进行逻辑“与”的运算。In some embodiments, the combined statistics generation operation 230 can generate combined statistics for pairs of unique values that appear together (i.e., are located in the same record) in two fields without forming a Cartesian product of the unique values that appear in the independent statistics of the fields. The Cartesian product can be used to calculate the combined statistics, for example, obtaining the position information of all pairs of values formed from the Cartesian product, and calculating the intersection of the relevant position information to locate records that share both values. However, the process of using the entire Cartesian product may be ineffective because the position information of many pairs may not overlap. The flowchart of Figure 4 and the schematic diagram of Figure 5 describe steps that can be used for the combined statistics generation operation 230, which can effectively identify and combine those pairs that do have overlapping position information and avoid pairs whose position information does not overlap. In this example, the position information can be represented and described by a record index vector, but other methods of representing position information can be used to calculate the vector. For example, for the f-value position information "f-A[]" and the g-value position information "g-A[]", the intersection of "f-A[]" and "g-A[]" can be obtained. Both vectors are represented by bit vectors, and logical "AND" operations are performed on the bit vectors respectively.

图4流程图中的步骤从已经准备了的统计集合320之后开始，如上所述，对于字段f和g在第一统计(该例中为f－统计)的分类条目上进行迭代(iteration)。剖析模块106读取(400)f－统计的下一个统计条目(即用于第一迭代的第一统计条目)。从当前条目的f值相关位置信息中，在A索引数据集310中查找(410)该值出现的第一条记录，以寻找在该记录的g字段中出现的配对g值。剖析模块106从与该g值相关的g－统计中检索(retrieve)(420)位置信息。所得到的g值位置信息(g-A[])与f值位置信息(f-A[])组合，以存储(430)用于识别共享该对(f-A[]‘与’g-A[])的所有记录的信息，并存储(440)用于对具有当前f值但g值不同的剩余记录集(f-A[]＝f-A[]‘与’(非g-A[]))的位置信息进行更新的信息。该成对值被写进(450)组合统计330中。剩余集的位置信息被检查(460)以确定其是否为空的。如果不是空的，查阅(410)剩余集中的第一记录以从A索引数据集310中发现与当前f值配对的另一g值。如果剩余集的位置信息是空的，则从f统计中读取下一个统计条目，以进行另一迭代。经该过程步骤的完整迭代对f统计中的所有条目进行读取并处理之后，组合统计330完成。The steps in the flowchart of FIG4 begin after the statistics set 320 has been prepared, as described above, and iteration is performed on the classification entries of the first statistics (in this case, the f-statistic) for the fields f and g. The parsing module 106 reads (400) the next statistics entry of the f-statistic (i.e., the first statistics entry for the first iteration). From the position information associated with the f value of the current entry, the A index data set 310 is searched (410) for the first record in which the value appears, to find the matching g value that appears in the g field of the record. The parsing module 106 retrieves (420) the position information from the g-statistic associated with the g value. The resulting g-value position information (g-A[]) is combined with the f-value position information (f-A[]) to store (430) information identifying all records that share the pair (f-A[]' and 'g-A[]) and to store (440) information for updating the position information for the remaining set of records that have the current f-value but a different g-value (f-A[] = f-A[]' and '(not g-A[])). The paired values are written (450) to the combined statistics 330. The position information of the remaining set is checked (460) to determine if it is empty. If not, the first record in the remaining set is consulted (410) to find another g-value that is paired with the current f-value from the A index data set 310. If the position information of the remaining set is empty, the next statistic entry is read from the f-statistics for another iteration. After all entries in the f-statistics have been read and processed through a complete iteration of this process step, the combined statistics 330 is complete.

图5显示了图4程序的例子，箭头表示剖析模块106执行操作的顺序，直至到达部分完成的组合统计330’的那个点。A统计集合320内的f统计中的统计条目顺序地被考虑，从具有f值“q”的第一条目开始，基于之前的分类，值“q”是最常见的f值。位置信息矢量中的第一元素“l”用于查找(522)A索引数据集310中对应记录的g字段值。在该示例中，对应记录(即索引位置1处的记录)中g字段的值为“d”。该值“d”的g统计条目被查找，以检索出全部的具有值“d”的记录集，其用位置信息矢量表示。f值“q”的位置信息矢量与g值“d”的位置信息进行比较，从而形成两个记录集的位置信息：具有配对值“q”和“d”(g与f)的记录集527；具有值“q”、但不具有值“d”(f与非g)的记录集528。共有值的位置信息作为组合统计330’的适当条目的矢量被存储(529)。在该示例中，组合统计中的条目有每个值的来源、值及其统计计数、同时包含两个值的所有记录的位置信息和所述记录的数量。条目“g f d q 2 3 2 A[1,4]”表示g值“d”出现2次，f值“q”出现3次，有2个记录同时具有这两个值，其对应于位置索引1和4。FIG5 shows an example of the program of FIG4 , with arrows indicating the order in which the analysis module 106 performs operations until the point at which the partially completed combined statistics 330 ′ is reached. The statistical entries in the f statistics within the A statistics set 320 are considered sequentially, starting with the first entry with an f value of “q”, which is the most common f value based on the previous classification. The first element “l” in the position information vector is used to look up (522) the g field value of the corresponding record in the A index data set 310. In this example, the g field value in the corresponding record (i.e., the record at index position 1) is “d”. The g statistics entry with this value “d” is searched to retrieve all the record sets with the value “d”, which are represented by the position information vector. The position information vector for the f value “q” is compared with the position information for the g value “d”, thereby forming the position information for two record sets: a record set 527 with paired values “q” and “d” (g and f); and a record set 528 with the value “q” but not the value “d” (f and non-g). The position information of the shared values is stored as a vector of appropriate entries in the combined statistics 330' (529). In this example, the entries in the combined statistics have the source of each value, the value and its statistical count, the position information of all records that contain both values, and the number of such records. The entry "g f d q 2 3 2 A[1,4]" indicates that the g value "d" occurs 2 times, the f value "q" occurs 3 times, and there are 2 records with both values, corresponding to position indices 1 and 4.

剖析模块106对具有值“q”的剩余记录集528进行处理。位置信息矢量的第一(同时也是仅仅)元素“5”用于查找(540)A索引数据集310中对应记录的g字段值，其从g字段获得值“b”。查找(544)值为b的g统计条目，以获取位置信息A[5]，与集528相比。计算(546)这些集的交集，并存储在组合统计330’的适当条目中。集与集之间的区别在该示例中为空，但是如果不是，迭代该程序以寻找与f字段的该特定值“q”相配对的g字段其他值。当具有f值“q”的记录集被用尽时，该程序移至f统计中的下一条目，并重复该过程。The parsing module 106 processes the remaining set of records 528 with the value "q". The first (and only) element "5" of the position information vector is used to look up (540) the g field value of the corresponding record in the A index data set 310, which obtains the value "b" from the g field. The g statistics entry with the value b is looked up (544) to obtain the position information A[5], which is compared with the set 528. The intersection of these sets is calculated (546) and stored in the appropriate entry of the combined statistics 330'. The difference between the sets is null in this example, but if it is not, the program is iterated to find other values of the g field that are paired with this particular value "q" of the f field. When the set of records with the f value "q" is exhausted, the program moves to the next entry in the f statistics and repeats the process.

参见图6A和6B，剖析模块106能够执行确定配对字段值之间相关性(函数依赖)的过程步骤，并扩大组合统计以包括关联分数，其对函数依赖结果进行定量。如上所述，基于字段的单个统计对配对字段的组合统计进行计算(600)，对所述配对字段进行潜在函数依赖的分析。对于组合统计的条目中所包含的每对值中的每个值，计算“关联分数”(610)，其表示具有该配对值的记录数量(“对计数”)和具有该值的记录数量(“值计数”)的比率。剖析模块106将关联分数(在该示例中为“1”和“2/3”)存储在条目中以产生扩大的组合统计615。例如，组合的统计条目“g f d q 2 3 2 A[1,4]”表示，在A索引数据集310中，有2条记录的字段g值为d,有3条记录的字段f值为q，有2条记录的配对值为“d q”，其索引为1和4。g值“d”(与“q”配对)的关联分数为2/2＝1，而f值“q”(与“d”配对)的关联分数为2/3。6A and 6B , the analysis module 106 can perform the process steps of determining the correlation (functional dependency) between paired field values and expanding the combined statistics to include a correlation score that quantifies the functional dependency results. As described above, the combined statistics of the paired fields are calculated (600) based on the individual statistics of the fields, and the paired fields are analyzed for potential functional dependencies. For each value in each pair of values included in the entry of the combined statistics, a "correlation score" is calculated (610), which represents the ratio of the number of records with that paired value ("pair count") to the number of records with that value ("value count"). The analysis module 106 stores the correlation score (in this example, "1" and "2/3") in the entry to generate the expanded combined statistics 615. For example, the combined statistics entry "g f d q 2 3 2 A[1,4]" indicates that in the A index data set 310, there are 2 records with the field g value d, 3 records with the field f value q, and 2 records with the paired value "d q" with indexes 1 and 4. The g-value "d" (paired with "q") has a correlation score of 2/2 = 1, while the f-value "q" (paired with "d") has a correlation score of 2/3.

将每个值的关联分数与阈值比较以确定(620)哪个值在该阈值处相关。例如，如果关联分数超出阈值0.95，100个例子中低于5个的配对值不同于当前值。此处，“d”在阈值0.95处与“q”相关，但是反过来不是：如果g值为“d”,则对应的f值确定为“q”，但是如果f值为“q”，对应q值为“d”的几率只有2/3。The correlation score for each value is compared to a threshold to determine (620) which values are correlated at that threshold. For example, if the correlation score exceeds a threshold of 0.95, fewer than 5 out of 100 examples have a value that is different from the current value. Here, "d" is correlated with "q" at a threshold of 0.95, but not vice versa: if the g value is "d", the corresponding f value is determined to be "q", but if the f value is "q", the probability that the corresponding q value is "d" is only 2/3.

可以对在给定阈值处相关的记录总数(在一个字段)进行计数(630)，并除以总的记录数，以确定(640)在给定阈值处相关的记录在整个数据集中所占的分数。如果该分数超出了第二个字段相关阈值，整个字段可以认为与另一字段相关。在一些实施方式中，对确定字段相关有贡献的记录数量可以将值的例数低于阈值的记录数量排除出去，或者基于所述值有可能是错误的而报告相关性。这是因为，如果例数太小，相关性可能是偶然的，或者是微不足道的，例如只有一个例子的情况，某一值与其配对值相关。The total number of records (in a field) that are correlated at a given threshold can be counted (630) and divided by the total number of records to determine (640) the fraction of records in the entire data set that are correlated at the given threshold. If the fraction exceeds a second field correlation threshold, the entire field can be considered correlated with the other field. In some embodiments, the number of records contributing to determining field correlation can exclude records with a value that has a lower number of instances than the threshold, or report correlations based on the likelihood that the value is erroneous. This is because if the number of instances is too small, the correlation may be accidental or insignificant, such as when there is only one instance where a value is correlated with its paired value.

对其他字段重复进行字段关联计算，以确定在其他方向关联的记录分数。在该示例中，将所有的g值逐一与f值相关，从而关联记录的总数为6，记录的总数为6，记录的关联分数为6/6＝1。结论是g确定f：知道了g值，就确保知晓对应的f值(在该字段关联阈值处)。相反，如果没有f值超过关联阈值，则f字段与g字段不关联。Repeat the field association calculation for other fields to determine the fraction of records associated in the other direction. In this example, all g values are correlated with f values one by one, so that the total number of associated records is 6. The total number of records is 6, and the record association score is 6/6 = 1. The conclusion is that g determines f: knowing the g value ensures that the corresponding f value (at the field association threshold) is known. Conversely, if no f value exceeds the association threshold, the f field is not associated with the g field.

参见图7，进行从函数依赖结果340至A-询问结果710的边edge drill-down)(700)(例如根据用户在图形用户接口中与所示结果340之间的互动)，以显示与记录相关的更加详细的信息，其在所示结果340中用边线表示，边线指示了这些记录中出现的配对值。用边线表示的配对值“d q”在组合统计330”(其与经发生频率分类并经位置信息的第一索引进一步分类的组合统计330相同)中查找(725)，以得到该对的位置信息。位置信息然后被用于从A索引数据集310中提取(735)相关记录。这些记录被显示(745)在A-询问结果710中。7 , an edge drill-down (700) is performed from the functional dependency result 340 to the A-query result 710 (e.g., based on a user's interaction with the illustrated result 340 in a graphical user interface) to display more detailed information related to the records, which are represented in the illustrated result 340 by edges indicating the pairing values that appear in those records. The pairing values "d q" represented by the edges are looked up (725) in the combined statistics 330 (which is the same as the combined statistics 330 sorted by frequency of occurrence and further sorted by the first index of position information) to obtain the position information for the pair. The position information is then used to extract (735) the relevant records from the A-index dataset 310. These records are displayed (745) in the A-query result 710.

在前述示例中，计算同一数据集中两个字段的(g和f)相关性。对经关键字段联系的不同数据集中两个字段的相关性进行的计算参见图8A和8B。A源数据集800和B源数据集820每一个均具有三个字段，它们具有一个共有的关键字段。该共有关键字段的键值并不一定是唯一的。但是，关键字段的键值用于将两个数据集中的相应记录经各自关键字段中的相同键值联系起来。A源数据集800中每个记录的唯一确认符(称为A-记录_id)作为字段添加至每条记录，从而产生A索引数据集810。类似地，B源数据集820中每个记录的唯一确认符(称为B-记录_id)作为字段添加至每条记录，从而产生B索引数据集830。索引地图840将每个A-记录_id与相同记录中关键字段的键值联系起来。因此，索引地图840为A索引数据集810的前两列的复制。索引地图840可以与A索引数据集810分别存储，例如存储在剖析数据存储器110的文件中。In the aforementioned example, the correlation between two fields (g and f) in the same dataset was calculated. For the calculation of the correlation between two fields in different datasets linked by a key field, see Figures 8A and 8B. Source dataset A 800 and source dataset B 820 each have three fields, sharing a common key field. The key value of this shared key field is not necessarily unique. However, the key value of the key field is used to link corresponding records in the two datasets via the same key value in their respective key fields. A unique identifier for each record in source dataset A 800 (called A-record_id) is added as a field to each record, thereby generating index dataset A 810. Similarly, a unique identifier for each record in source dataset B 820 (called B-record_id) is added as a field to each record, thereby generating index dataset B 830. Index map 840 links each A-record_id to the key value of the key field in the same record. Therefore, index map 840 is a copy of the first two columns of index dataset A 810. The index map 840 may be stored separately from the A-index dataset 810 , for example, in a file in the profile data storage 110 .

在该示例中，关键字段为A源数据集800的主键(在图8A的A源数据集800的第一列显示)，为B源数据集820的外来键(在图8A的B源数据集820的第二列显示)。可以选择A-记录_id值以映射至其关键字段(而不使用B-记录_id映射至其关键字段)，其原因是该关键字段为主键。然而，数据集并不一定需要具有该主键/外来键关系，只要每个数据集中某个字段被指定为关键字段。索引地图840是有用的，其原因是，在A数据集中具有重复的主键值，如该例中，两个数据集均具有在两个不同记录中重复的关键字段值。采用该索引地图840，剖析模块106产生新版本的B索引数据集830，其具有包含A-记录_id值的新字段，从而两个数据集具有共同的参考框架，以确定位置信息。为此，剖析模块106将B索引数据集830中的关键字段值与索引地图840中的键值比较，以寻找与对应A-记录_id值的匹配数。在该示例中，B索引数据集830中的一个记录(具有外来键值k4)与两个不同的A-记录_id值匹配，当将A-记录_id添加至B索引数据集830记录中以产生B/A索引数据集850时，剖析模块添加(847)两个相应记录(具有“A4”的一个添加为索引，具有“A6”的另一个添加为索引)。B索引数据集830的其他记录与索引地图840中的单个A-记录_id值匹配，从而它们的每一个对应于添加至B/A索引数据集850中的单个记录，对应的A-记录_id值添加为索引。In this example, the key field is the primary key of the A source dataset 800 (shown in the first column of the A source dataset 800 in Figure 8A) and the foreign key of the B source dataset 820 (shown in the second column of the B source dataset 820 in Figure 8A). The A-record_id value can be selected to map to its key field (rather than using the B-record_id to map to its key field) because the key field is the primary key. However, the datasets do not necessarily need to have this primary key/foreign key relationship as long as a field in each dataset is designated as the key field. The index map 840 is useful because there are duplicate primary key values in the A dataset, such as in this example where both datasets have duplicate key field values in two different records. Using the index map 840, the analysis module 106 generates a new version of the B index dataset 830, which has a new field containing the A-record_id value so that the two datasets have a common reference frame to determine location information. To do this, the parsing module 106 compares the key field values in the B-index dataset 830 with the key values in the index map 840 to find the number of matches with the corresponding A-record_id values. In this example, one record in the B-index dataset 830 (with the foreign key value k4) matches two different A-record_id values. When the A-record_id is added to the B-index dataset 830 record to generate the B/A index dataset 850, the parsing module adds (847) two corresponding records (one with "A4" added to the index and the other with "A6" added to the index). The other records of the B-index dataset 830 match a single A-record_id value in the index map 840, so that each of them corresponds to a single record added to the B/A index dataset 850, with the corresponding A-record_id value added to the index.

现在参见图8B，对A索引数据集810的第三字段计算A统计860，位置信息参见A-记录_id值(第一字段)，对B/A索引数据集850的第五字段计算B统计870，位置信息也参见用索引地图840添加的A-记录_id值(第一字段)。两个统计的位置信息如上所述结合起来，从而对用这些统计(来自A统计的字段用“A”标记，来自B统计的字段用“B”计算)表示的字段对计算组合统计880，之后将其用于显示函数依赖结果890，得出A和B字段不相关的结论。Referring now to Figure 8B, an A-statistic 860 is calculated for the third field of the A index dataset 810, with location information referenced by the A-record_id value (first field), and a B-statistic 870 is calculated for the fifth field of the B/A index dataset 850, with location information also referenced by the A-record_id value (first field) added using the index map 840. The location information of the two statistics is combined as described above, thereby calculating a combined statistic 880 for the field pairs represented by these statistics (fields from the A-statistic are labeled "A" and fields from the B-statistic are labeled "B"), which is then used to display the functional dependency results 890, concluding that the A and B fields are uncorrelated.

参见图9，从函数依赖结果890执行节点向下钻取(drill-down)(900)(例如，根据用户与图形用户接口中所示结果340之间的互动)，以显示与用节点所示值表示的记录相关的更加详细的信息。从B字段选择显示“p”值的节点，提取在A源数据集800和B源数据集820中均显示对应于此的记录，并显示在节点询问结果910中。首先在组合统计880中查找(915)p值，以找到每个匹配的条目(含有B字段为“p”)，从而实现向下钻取(drill-down)。合并从而将这些条目的位置信息组合起来，产生位置信息A[1，3，4，6](相对于A-记录_id值)。然后在A索引数据集810和B/A索引数据集850中查找这些位置中的每一个以提取这些位置处的任何记录。A索引数据集810中被提取的记录显示(935)在节点询问结果910中，用“A”标记。比较B源数据集820中被提取的记录，以寻找任何具有相同B-记录_id值的记录，对其进行去重(945)，从而只有一个显示在节点询问结果910中。在该示例中，A-记录_id值为A4和A6的记录具有相同的B-记录_id值B2。通过具有对应于单个B源数据集820记录的多个A源数据集800记录，A源数据集800记录中的重复关键字段值显示在节点询问结果910中。9 , a node drill-down (900) is performed from the functional dependency result 890 (e.g., based on a user's interaction with the result 340 shown in the graphical user interface) to display more detailed information related to the record represented by the value shown in the node. The node displaying the value "p" from the B field is selected, and the records corresponding to this value are extracted in both the A source dataset 800 and the B source dataset 820 and displayed in the node query result 910. First, the p value is searched (915) in the combined statistics 880 to find each matching entry (containing the B field as "p"), thereby achieving a drill-down. The position information of these entries is combined to generate the position information A[1, 3, 4, 6] (relative to the A-record_id value). Each of these positions is then searched in the A index dataset 810 and the B/A index dataset 850 to extract any records at these positions. The extracted records in the A index dataset 810 are displayed (935) in the node query result 910, marked with "A". The extracted records in the B source dataset 820 are compared to find any records with the same B-record_id value, and duplicates are removed (945) so that only one is displayed in the node query result 910. In this example, records with A-record_id values of A4 and A6 have the same B-record_id value of B2. By having multiple A source dataset 800 records corresponding to a single B source dataset 820 record, duplicate key field values in the A source dataset 800 records are displayed in the node query result 910.

图10中，显示了用分段法计算统计的例子。A源数据集100具有3个字段,f，g和h。A源数据集100中每个记录的唯一确认符(称为记录_id)作为字段被添加至每条记录以生成A索引数据集1010。计算经分类的统计集合1020，其具有三个字段中每一个的位置信息。在一些实施方式中，系统100可以使用户形成询问，以回答商业问题，例如，通过将f字段和g字段的组合限定至包含特定值而分段得到的h字段的数据分档(即值的分布)是什么？例如，f字段可以代表性别，记录的f字段分别具有可能值“f”或“m”；而g字段可以代表“外国”或“本国”，记录的g字段可以分别具有“p”或“q”值。对f和g字段进行分段的数据分档(例如用统计表示)有利于回答问题，例如对于“外国男性”或“本国女性”，h字段最常见的值是什么？FIG10 illustrates an example of calculating statistics using a segmentation approach. A source dataset 100 has three fields: f, g, and h. A unique identifier (referred to as record_id) for each record in source dataset 100 is added as a field to each record to generate index dataset 1010. A sorted set of statistics 1020 is calculated, which includes positional information for each of the three fields. In some embodiments, system 100 can enable a user to formulate queries to answer business questions, such as, "What is the data binning (i.e., distribution of values) for the h field, obtained by segmenting the combination of the f and g fields to include specific values?" For example, the f field may represent gender, with records having the possible values "f" or "m," respectively; while the g field may represent "foreign" or "domestic," with records having the values "p" or "q," respectively. Segmenting the data binning (e.g., expressed as statistics) for the f and g fields can help answer questions such as, "What is the most common value for the h field for a "foreign male" or "domestic female"?"

集合1020可以用于计算分段的分档，而不需要对A源数据集1000中的所有记录进行处理。可以采用f字段和g字段统计以及上述计算组合统计的过程步骤将分段统计1030构建成组合统计。在一些实施方式中，分段统计1030的每个条目被赋予一唯一值(称为分段_id)，以便于确定与该条目相关的分段。再次采用计算组合统计的过程步骤以形成分段的组合统计1040，其为来自A统计集合1020的h字段统计和分段统计1030的组合。例如，首先提取分段统计1030的s1条目，读取相关能位置信息A[1，4]，计算分段的组合统计1040的h-s1条目。在统计中对A统计集合1020的h字段查找位置矢量的第一元素“1”，以寻找h字段值“d”，和对应的位置信息A[1，4]。将分段统计1030中用s1标记的条目的位置信息A[1,4]与h统计的d条目的位置信息A[1，4]比较。由于这两个条目之间的位置信息的所有元素均匹配，所得到的s1分段的组合统计条目矢量为“d 2 A[1,4]”，s1分段中没有剩余的位置条目。其表示，s1分段仅由单个h字段值d组成。采用其他分段值结合h统计记录，连续构建组合统计，以实现分段的组合统计1040。The set 1020 can be used to calculate the segmented bins without processing all records in the A source data set 1000. The segmented statistics 1030 can be constructed into combined statistics using the f field and g field statistics and the above-mentioned process steps of calculating combined statistics. In some embodiments, each entry of the segmented statistics 1030 is assigned a unique value (called segment_id) to facilitate identification of the segment associated with the entry. The process steps of calculating combined statistics are again used to form segmented combined statistics 1040, which is a combination of the h field statistics from the A statistics set 1020 and the segmented statistics 1030. For example, first extract the s1 entry of the segmented statistics 1030, read the relevant energy position information A[1,4], and calculate the h-s1 entry of the segmented combined statistics 1040. In the statistics, search for the first element "1" of the position vector for the h field of the A statistics set 1020 to find the h field value "d" and the corresponding position information A[1,4]. Compare the position information A[1,4] of the entry labeled s1 in segment statistics 1030 with the position information A[1,4] of the entry d in the h statistics. Because all elements of the position information between these two entries match, the resulting combined statistics entry vector for segment s1 is "d 2 A[1,4]," with no remaining position entries in segment s1. This indicates that segment s1 consists of only a single h field value, d. Combined statistics are continuously constructed using other segment values combined with h statistics records to achieve segmented combined statistics 1040.

图11显示了分段立方体计算的例子。当基于多个字段值的组合进行分段时，如图10的例子，可以构建分段立方体，其中分段的统计结果被重新集合成统计结果，其中分段涉及更少字段的每个组合。对于图10的例子，经计算的分段组合统计1040代表例如“外国男性”(即段s4)和“本国男性”(即s1)的分段。用户可以要求对段“外国”或“男性”进行剖析。对于新的分段，不是直接返回以重复图10的计算，而是将之前的分段结果与h统计结合以计算“分段立方体”中的其他条目，如下所述。FIG11 shows an example of a segmented cube calculation. When segmentation is based on a combination of multiple field values, as in the example of FIG10 , a segmented cube can be constructed in which the statistics of the segmentation are reassembled into statistics where the segmentation involves each combination of fewer fields. For the example of FIG10 , the calculated segmented combination statistics 1040 represent, for example, the segmentation of “foreign males” (i.e., segment s4) and “domestic males” (i.e., s1). The user can request that the segment “foreign” or “males” be profiled. For a new segmentation, rather than returning directly to repeat the calculation of FIG10 , the previous segmentation results are combined with the h-statistic to calculate other entries in the “segmented cube”, as described below.

为了形成分段立方体1020，首先形成原始分段字段的每个子集。在本例中，基于两个分段字段f和g进行完全分段。两个字段集形成两个子集：一个集仅仅由f组成，另一集仅仅由g组成。将每个集称为分段立方体字段。如果原始分段字段由三个字段f,g和h组成，分段立方体字段可以为集{f,g},{f,h},{g,h},{f},{g},和{h}，即，分段立方体字段为分段字段的所有非空子集的成员。To form segment cube 1020, each subset of the original segment fields is first formed. In this example, full segmentation is performed based on two segment fields, f and g. The two field sets form two subsets: one consisting only of f, and the other consisting only of g. Each set is called a segment cube field. If the original segment field consists of three fields, f, g, and h, the segment cube field can be the set {f, g}, {f, h}, {g, h}, {f}, {g}, and {h}. That is, the segment cube field is a member of all non-empty subsets of the segment field.

分段立方体1120的条目由与每个分段立方体字段相关的每个特异值(或值组合)组成。在一些实施方式中，对于分段立方体字段的每个值，含有该值的分段集合被确定，并作为数据结构中的分段位置信息，以存储分段立方体1120和所述分段的计数。另一个替代方式是对分段统计1030的每个对应条目进行合并(称为分段－统计条目)，相对于增加至A索引数据集1010中的记录_id，组合位置信息。采用分段位置信息(即相对于分段_id的位置信息)，而不是A位置信息(即相对于记录_id的位置信息)，将会更加有效，其原因是，通常分段比记录少得多，因此其位置信息将更加紧凑。在一些实施方式中，向分段立方体1120的每个条目增加一个字段以标记条目。An entry in segment cube 1120 consists of each unique value (or combination of values) associated with each segment cube field. In some embodiments, for each value of a segment cube field, the set of segments containing that value is determined and used as segment location information in a data structure to store segment cube 1120 and the count of the segment. Another alternative approach is to merge each corresponding entry in segment statistics 1030 (referred to as a segment-statistics entry) and combine the location information with the record_id added to the A-index dataset 1010. Using segment location information (i.e., location information relative to segment_id) rather than A-location information (i.e., location information relative to record_id) is more efficient because there are typically far fewer segments than records, so their location information is more compact. In some embodiments, a field is added to each entry in segment cube 1120 to identify the entry.

在该示例中，分段立方体字段f在段s1和s2具有值f，因此，相关的分段位置信息为S[s1,s2]。其形成了分段立方体条目“c1f f 2 S[s1,s2]”。此处，c1为分段立方体条目的标记，第一个f为分段立方体字段，f为其值。该值出现在两个段，用位置信息S[s1,2]表示。或者，分段立方体字段g在段s1和s4具有值“q”,因此，相关的分段位置信息为S[s1,s4]。其形成了分段立方体条目“c4g q 2 S[s1,s4]”。In this example, the segment cube field f has the value f in segments s1 and s2, so the associated segment position information is S[s1,s2]. This forms the segment cube entry "c1f f 2 S[s1,s2]." Here, c1 is the segment cube entry's tag, the first f is the segment cube field, and f is its value. This value appears in both segments, indicated by the position information S[s1,2]. Alternatively, the segment cube field g has the value "q" in segments s1 and s4, so the associated segment position information is S[s1,s4]. This forms the segment cube entry "c4g q 2 S[s1,s4]."

分段的组合统计1040通过以下过程步骤与分段立方体1120结合形成分段-立方体A统计集合1150。分段立方体1120中的每个条目含有分段位置信息，其用于确定哪个分段含有相关的分段立方体字段值。每个参考分段中分段统计条目组的合并给出了具有分段立方体字段值的统计条目集合。例如，c1分段立方体条目的分段位置信息为S[s1,s2]。对分段结合统计1040中的s1和s2结果的统计条目组进行合并，从而得到c1分段结果。s1分段由单个条目“d 2 A[1,4]”组成，而s2分段由两个条目“a 1 A[2]”和“e 1 A[6]”组成。这些条目的合并为所有三个条目的集，形成了分段立方体A统计集合1150的h统计的c1分段。从分段立方体可以看出，c1分段由其f字段具有f值的记录组成。因此，h统计的c1分段为其f字段具有f值的h统计分段。其可以通过检查图10中的A统计集合1020证实。记录A[1，2，4，6]的f字段具有值“f”，而记录A[1，4]的h字段具有值“d”,记录A[2]的h字段具有值“a”,记录A[6]的h字段具有值“e”。The segmented combined statistics 1040 are combined with the segmented cube 1120 to form the segmented-cube A statistics set 1150 through the following process steps. Each entry in the segmented cube 1120 contains segment location information, which is used to determine which segment contains the associated segmented cube field value. The merging of the set of segmented statistics entries in each reference segment gives a set of statistics entries with the segmented cube field value. For example, the segmented cube entry c1 has segment location information S[s1,s2]. The set of statistics entries for the s1 and s2 results in the segmented combined statistics 1040 are merged to produce the c1 segmented results. The s1 segment consists of a single entry "d 2 A[1,4]", while the s2 segment consists of two entries "a 1 A[2]" and "e 1 A[6]". The merging of these entries into a set of all three entries forms the c1 segment of the h statistics for the segmented cube A statistics set 1150. As can be seen from the segmented cube, the c1 segment consists of records whose f field has a value of f. Therefore, the c1 segment of the h-statistics is the h-statistics segment whose f field has the value f. This can be verified by examining the A-statistics set 1020 in Figure 10. The f field of record A[1, 2, 4, 6] has the value "f", while the h field of record A[1, 4] has the value "d", the h field of record A[2] has the value "a", and the h field of record A[6] has the value "e".

如果一个值出现在一个以上的分段，从每个分段的A位置信息的合并获得该结果的A位置信息。其不发生在图示的分段立方体1120中，但是如果分段立方体条目的分段位置信息为S[s2,s4]，则由于分段s2和s4的h值均为“e”，在该分段立方体统计结果中的“e”h值的A位置信息为A[3，6]，即来自s2分段的A[6]和来自s4分段的A[3]的合并。If a value appears in more than one segment, the resulting A position information is obtained by combining the A position information of each segment. This does not occur in the illustrated segment cube 1120, but if the segment position information of the segment cube entry is S[s2,s4], then since the h value of segments s2 and s4 is both "e", the A position information of the h value "e" in the segment cube statistics is A[3,6], which is the combination of A[6] from segment s2 and A[3] from segment s4.

例如S[s2,s4]的分段立方体条目是更广泛的分段立方体中分段立方体条目的一个例子，所述分段立方体经分段组合而不是简单的字段组合而形成。在一些实施方式中，允许进行所述分段组合。在所述条目中，经允许的分段字段值对应于与经选择的分段组合的每个分段相关的值。在该示例中，分段立方体条目S[s2,s4]对应于f-g字段的值要么是“fp”要么是“m q”的分段。这使得可以形成复杂分段，其中允许字段和字段值的有条件组合。A segment cube entry such as S[s2,s4] is an example of a segment cube entry within a broader segment cube that is formed by combining segments rather than simply combining fields. In some embodiments, such segment combinations are permitted. In the entry, the permitted segment field values correspond to the values associated with each segment of the selected segment combination. In this example, the segment cube entry S[s2,s4] corresponds to a segment where the value of the f-g field is either "fp" or "m q." This allows for the formation of complex segments where conditional combinations of fields and field values are permitted.

可以从具有位置信息的统计中计算多字段验证规则，而不要求对原始数据集中的所有记录进行处理。多字段验证规则将条件应用至两个或多个字段值，一条记录必须同时满足所述条件才被认为是有效的。不满足该条件的记录被认为无效的。多字段验证规则的例子是：如果f值(性别)为“f”，则g值(外国/本国)必须是“p”。在一些实施方式中，验证规则以否定形式表示，即给出两个字段值组合的规则，如果满足则认定该记录是无效的。在本例中，确定无效记录的规则可以是：如果f值为“f”，g值不是“p”，则该记录是无效的。Multi-field validation rules can be calculated from statistics with location information without the requirement to process all records in the original data set. Multi-field validation rules apply conditions to two or more field values, and a record must meet the conditions at the same time to be considered valid. Records that do not meet the conditions are considered invalid. An example of a multi-field validation rule is: if the f value (gender) is "f", then the g value (foreign/domestic) must be "p". In some embodiments, the validation rule is expressed in a negative form, that is, a rule for a combination of two field values is given, and if it is met, the record is deemed invalid. In this example, the rule for determining invalid records can be: if the f value is "f" and the g value is not "p", then the record is invalid.

数据质量报告可以包括针对一个或多个验证规则的有效和无效记录的数量。如果在进行初始统计之前制定了验证规则，在收集统计以及对相关的有效和无效记录进行统计的过程中，可以对规则进行检验。然而，通常验证规则在初始统计之后提出，以响应于没有被统计覆盖的值和值组合。在此情况下，不是重新进行统计并应用新的验证规则，而是采用具有位置信息的统计以确定有效和无效记录，而不需要对统计进行重新计算。由于多字段验证规则表达为字段值的条件组合，对应于验证规则中每个值的统计条目可以被组合，典型地对位置信息采用布尔(Boolean)操作，用于检查该规则。被认定为无效的任何值组合可以标记为无效，并将其计数至无效记录集合中。位置信息还可以用于向下钻取(drill-down)以确定在验证规则下要么是无效要么是有效的特定记录。The data quality report may include the number of valid and invalid records for one or more validation rules. If validation rules are established before the initial statistics are performed, the rules can be tested in the process of collecting statistics and performing statistics on the relevant valid and invalid records. However, validation rules are usually proposed after the initial statistics in response to values and value combinations that are not covered by the statistics. In this case, instead of re-performing statistics and applying new validation rules, statistics with position information are used to determine valid and invalid records without the need to recalculate the statistics. Since multi-field validation rules are expressed as conditional combinations of field values, the statistical entries corresponding to each value in the validation rule can be combined, typically using Boolean operations on the position information to check the rule. Any value combination that is deemed invalid can be marked as invalid and counted in the invalid record set. Position information can also be used to drill down to determine specific records that are either invalid or valid under the validation rules.

就以下验证规则而言，如果f值为“f”，则g值必须是“p”。A统计集合1020可以用于计算Boolean‘f＝”f”和g＝”p”’。具有f＝”f”的记录的位置信息为A[1,2,4,6]，而具有g＝”p”的记录的位置信息为A[2,3,6]。通过这两组位置信息的交集A[2,6]可以形成有效记录。无效记录为经Boolean‘f＝”f”and g！＝”p”’运算得到的经矢量A[1,4]定位的无效记录。所得到的位置信息之后可以用于提取有效或无效记录。例如，可以从A索引文件1010中提取记录2和6，并返回具有f＝”f”和g＝”p”的两条记录。With respect to the following validation rule, if the value of f is "f", the value of g must be "p". The A statistical set 1020 can be used to calculate Boolean 'f = "f" and g = "p"'. The position information of the record with f = "f" is A[1,2,4,6], and the position information of the record with g = "p" is A[2,3,6]. A valid record can be formed by the intersection A[2,6] of these two sets of position information. The invalid record is the invalid record located by the vector A[1,4] obtained by the operation of Boolean 'f = "f" and g! = "p"'. The obtained position information can then be used to extract valid or invalid records. For example, records 2 and 6 can be extracted from the A index file 1010, and two records with f = "f" and g = "p" can be returned.

可以采用执行适当软件的计算系统实施上述技术。例如，软件可以包括一个或多个计算机程序中的程序，其在一个或多个编程或可编程的计算系统中运行(其可以具有各种结构，例如分布的、客户/服务器、或网格)，每个包括至少一个处理器、至少一个数据存储系统(包括易失存储器和/或非易失存储器和/或存储元件)、至少一个用户接口(采用至少一个输入装置或端口接收输入，并采用至少一个输出装置或端口提供输出)。软件可以包括更大程序的一个或多个模块，例如，提供与数据流图形的设计、构造、和运行相关的服务。程序的模块(例如数据流图形的单元)可以为符合存储在数据储存库中数据模型的数据结构或其他有组织的数据。The above techniques can be implemented using a computing system that executes appropriate software. For example, the software can include one or more computer programs that are run on one or more programmed or programmable computing systems (which can have various architectures, such as distributed, client/server, or grid), each including at least one processor, at least one data storage system (including volatile memory and/or non-volatile memory and/or storage elements), and at least one user interface (using at least one input device or port to receive input and at least one output device or port to provide output). The software can include one or more modules of a larger program, for example, providing services related to the design, construction, and operation of data flow graphs. The modules of a program (e.g., elements of a data flow graph) can be data structures or other organized data that conform to a data model stored in a data repository.

上述数据剖析方法可以使用执行适当软件的计算机系统来实现。例如，软件可以包括在一个或多个已编程或可编程计算系统(可以具有各种架构，诸如分布式、客户端/服务器、或网格式)上执行的一个或多个计算机程序中的过程，每个计算系统包括至少一个处理器、至少一个数据存储系统(包括易失性和/或非易失性存储器和/或存储元件)以及至少一个用户接口(用于使用至少一个输入设备或端口来接收输入，以及用于使用至少一个输出设备或端口来提供输出)。该软件可包括大型程序的一个或多个模块，例如，该大型程序提供与数据流图的设计、配置和执行相关的其它服务。该程序的模块(例如，数据流图的元件)可以被实施为数据结构或者符合在数据库中存储的数据模型的其它组织的数据。The above-mentioned data analysis method can be implemented using a computer system that executes appropriate software. For example, the software may include processes in one or more computer programs executed on one or more programmed or programmable computing systems (which may have various architectures, such as distributed, client/server, or grid formats), each computing system including at least one processor, at least one data storage system (including volatile and/or non-volatile memory and/or storage elements), and at least one user interface (for receiving input using at least one input device or port, and for providing output using at least one output device or port). The software may include one or more modules of a large program, for example, the large program provides other services related to the design, configuration, and execution of data flow graphs. The modules of the program (e.g., elements of a data flow graph) may be implemented as data structures or other organized data that conform to a data model stored in a database.

该软件可以被提供在诸如CD-ROM或其他计算机可读介质之类的有形永久存储介质(例如可以被通用或专用计算机系统或装置读取的介质)上，或者通过网络的通信介质递送(例如编码成传送信号)到执行该软件的计算机系统的有形永久介质处。一些或全部处理可以在专用计算机上执行，或者使用诸如协处理器或现场可编程门阵列(FPGA)或专用集成电路(ASIC)之类的专用硬件来执行。该处理可以以分布方式实施，在该分布方式中，由该软件指定的不同的计算部分由不同的计算元件执行。每个这样的计算机程序被优选地存储在或下载到可由通用或专用可编程计算机读取的存储设备的计算机可读存储介质(例如，固态存储器或介质、或者磁或光介质)，用于在计算机读取该存储介质或设备时配置和操作该计算机，以执行此处所描述的处理。也可以考虑将本发明的系统实施为有形永久存储介质，其配置有计算机程序，其中，如此配置的存储介质使得计算机以特定和预定义的方式操作以执行此处所描述的一个或多个处理步骤。The software can be provided on a tangible permanent storage medium (e.g., a medium that can be read by a general or special computer system or device) such as a CD-ROM or other computer-readable medium, or delivered (e.g., encoded into a transmission signal) to a tangible permanent medium of a computer system that executes the software via a communication medium of a network. Some or all of the processing can be performed on a dedicated computer, or performed using dedicated hardware such as a coprocessor or a field programmable gate array (FPGA) or an application-specific integrated circuit (ASIC). The processing can be implemented in a distributed manner in which different computing parts specified by the software are performed by different computing elements. Each such computer program is preferably stored in or downloaded to a computer-readable storage medium (e.g., a solid-state memory or medium, or a magnetic or optical medium) of a storage device that can be read by a general or special programmable computer, for configuring and operating the computer when the computer reads the storage medium or device to perform the processing described herein. It is also contemplated that the system of the present invention can be implemented as a tangible permanent storage medium configured with a computer program, wherein the storage medium so configured causes the computer to operate in a specific and predefined manner to perform one or more processing steps described herein.

已经对本发明的多个实施例进行了描述。然而，应当理解，前面的描述旨在说明而非约束本发明的范围，本发明的范围由以下权利要求书的范围来限定。因此，其它实施例也落在以下权利要求书的范围内。例如，在不脱离本发明的范围的情况下可进行各种修改。此外，上述的一些步骤可以是无顺序关联的，因此可以以不同于所述的顺序来执行。Several embodiments of the present invention have been described. However, it should be understood that the foregoing description is intended to illustrate, not to restrict, the scope of the present invention, which is defined by the following claims. Therefore, other embodiments are also within the scope of the following claims. For example, various modifications may be made without departing from the scope of the present invention. Furthermore, some of the steps described above may be performed in an unrelated order and, therefore, may be performed in an order different from that described.

Claims

1. A method for profiling data stored in at least one data storage system, the method comprising:

Access at least one set of records stored in the data storage system via an interface connected to the data storage system; and

The record set is processed to generate result information, the result information representing values appearing in one or more specific fields of the record set, the processing including:

For the first set of outliers appearing in the first set of one or more fields of the records in the set,

The corresponding location information is generated, which identifies all records containing each outlier in the first set of outliers.

For one or more fields in the first group, a corresponding list of entries is generated. Each entry identifies an outlier from the outliers in the first group, along with the location information of that outlier.

For a second group of one or more fields in the set that are different from the first group of one or more fields, a corresponding list of entries is generated. Each entry determines a unique value from the unique values appearing in the second group of one or more fields.

At least in part, the following is used to generate result information characterizing values appearing in one or more specific fields of the record set: at least one record in the record set is located using positional information of at least one value appearing in at least one value in the first set of one or more fields; and at least one value appearing in the second set of one or more fields of the located record is determined.

For the second set of outliers, corresponding location information is generated, which identifies all records where each outlier occurs for each outlier in the second set of outliers.

Wherein, for a list corresponding to one or more fields of the second group, each entry identifying an outlier from the second group of outliers includes the location information of that outlier, and

The processing further includes: for a set of outlier pairs, generating corresponding location information, wherein a first value in each pair appears in one or more fields of the first group of the record, and a second value in each pair appears in one or more fields of the second group of the record, the location information identifying all records in which the outlier pair appears for each outlier pair.

2. The method of claim 1, wherein each entry further identifies a count of the number of records in which a particular value appears in one or more sets of fields.

3. The method of claim 2, wherein the processing further comprises classifying the entries in each list by means of the identified count.

4. The method of claim 1, wherein generating location information of a pair of outliers from the set of outlier pairs comprises: determining the intersection between location information of a first outlier from the first set of outliers and location information of a second outlier from the second set of outliers.

5. The method of claim 4, wherein determining the intersection comprises: using the location information of the first outlier to locate a record in the set, and using the located record to determine the second outlier.

6. The method of claim 1, further comprising: classifying a plurality of lists according to the number of outliers identified in the entries of each list, the plurality of lists including lists corresponding to one or more fields of the first group and lists corresponding to one or more fields of the second group.

7. The method of claim 1, wherein the process further comprises:

For a set of outlier pairs, corresponding location information is generated, wherein the first value in each pair appears in one or more fields of the first group of records, and the second value in each pair appears in one or more fields of the second group of records, which are different from the one or more fields of the first group. The location information identifies all records in which each outlier pair appears.

For the set of outlier pairs, a corresponding list of entries is generated, where each entry identifies an outlier pair and its location information from the set of outlier pairs.

8. The method of claim 1, wherein the location information identifies a unique index value for all records in which the particular value occurs.

9. The method of claim 8, wherein the location information is identified by storing a specific unique index value.

10. The method of claim 8, wherein the location information is identified by encoding a unique index value within the location information.

11. The method of claim 10, wherein encoding the unique index value comprises storing a bit at a location within a vector corresponding to the unique index value.

12. The method of claim 1, wherein the set comprises a first subset of records and a second subset of records, the first subset of records having fields including one or more fields of the first group, and the second subset of records having fields including one or more fields of the second group.

13. The method of claim 12, wherein the processing further comprises generating information that provides a mapping between: 1) a field index value of the first subset of records, which associates a unique index value with every record in the first subset of records; and 2) a key value of a field of the second subset of records, which associates a key value with every record in the second subset of records; wherein the key value associates a record in the second subset of records with a record in the first subset of records.

14. The method of claim 13, wherein the location information identifies a unique index value for all records in which an outlier occurs.

15. A computer-readable medium for storing a computer program for profiling data stored in at least one data storage system, the computer program comprising causing a computing system to perform the method as described in any one of claims 1-14.

16. A computing system for profiling data stored in at least one data storage system, the computing system comprising:

An interface connected to the data storage system, used to access at least one set of records stored in the data storage system; and

At least one processor is configured to process the record set to produce result information, the result information representing values appearing in one or more specific fields of the record set, the processing including:

For one or more fields in a second group that are different from the first group of fields, a corresponding list of entries is generated. Each entry determines a unique value from the unique values appearing in the second group of fields.

17. A computing system for profiling data stored in at least one data storage system, the computing system comprising:

A means for accessing at least one set of records stored in the data storage system; and

Apparatus for performing the method according to any one of claims 1-14.