CN111400305A

CN111400305A - Traceability and visualization method based on feature engineering blood relationship

Info

Publication number: CN111400305A
Application number: CN202010103932.4A
Authority: CN
Inventors: 柴磊; 许靖; 许灿杰
Original assignee: Shenzhen Magic Digital Intelligent Artificial Intelligence Co ltd
Current assignee: Shenzhen Magic Digital Intelligent Artificial Intelligence Co ltd
Priority date: 2020-02-20
Filing date: 2020-02-20
Publication date: 2020-07-10
Anticipated expiration: 2040-02-20
Also published as: CN111400305B

Abstract

The invention discloses a traceable and visualized method based on a characteristic engineering blood relationship, which comprises the following steps: step one, establishing a blood relationship of database wide table integration, step two, establishing a data preprocessing blood relationship, step three, integrating a blood relationship, step four, establishing visual interaction, and establishing a traceable visual interaction interface in a multi-level manner based on the whole blood relationship. The invention solves the problems that the variable of the characteristic engineering result of the current industry is difficult to backtrack and the source of error data is difficult to search, and the blood-related relation solves the problems that the characteristic engineering experiment and production are disjointed and the variable processing is difficult to reproduce in the production environment, so that the whole process can be quickly reproduced based on the blood-related relation structure and can be quickly applied to the production environment.

Description

Traceability and visualization method based on feature engineering blood relationship

技术领域technical field

本发明通过人工智能进行数据处理的技术领域，尤其涉及一种利用计算机构建特征工程血缘关系，对特征进行回溯与可视化交互构建。The present invention is in the technical field of data processing through artificial intelligence, and in particular, relates to a method of constructing a blood relationship of feature engineering by using a computer, and performing retrospective and visual interactive construction of features.

背景技术Background technique

在大数据和人工智能的发展中，模型开发速度和模型效果的要求越来越高，与之对应的，数据的采集、汇总以及处理的效率要求也越来越高。In the development of big data and artificial intelligence, the requirements for model development speed and model effect are getting higher and higher. Correspondingly, the efficiency requirements for data collection, aggregation and processing are also higher and higher.

数据整合与处理是限制模型开发效率的最主要因素，机器学习建模是不断对模型进行调参和不断输入数据进行调整的过程。因此，对数据特征的处理流进行血缘关系构建，创建可回溯、可视化血缘关系结构变得十分有必要。Data integration and processing are the most important factors that limit the efficiency of model development. Machine learning modeling is a process of continuously adjusting the parameters of the model and inputting data to adjust. Therefore, it is necessary to construct the blood relationship for the processing flow of data features, and to create a traceable and visualized blood relationship structure.

构建特征的血缘关系，能使用户在建模过程，快速的调节模型的输入特征，复现从源特征到最终特征的处理流，提供给用户清晰的特征来源路径，并最终有利于实现数据处理跨平台处理。Constructing the blood relationship of features enables users to quickly adjust the input features of the model during the modeling process, reproduce the processing flow from source features to final features, provide users with a clear feature source path, and ultimately facilitate data processing. Cross-platform processing.

在模型上线过程中，可回溯的血缘关系结构有利于用户对错误数据进行回溯，及时定位到出现问题的地方，并快速更正。During the model launch process, the traceable blood relationship structure is helpful for users to trace back the erroneous data, locate the problem in time, and quickly correct it.

目前市面上方案较少，现有的方案存在以下几个问题：At present, there are few solutions on the market, and the existing solutions have the following problems:

1)血缘关系的记录只针对部分流程，没有全流程的记录，无法进行源到结果的复现。1) The record of blood relationship is only for part of the process, and there is no record of the whole process, and it is impossible to reproduce the source to the result.

2)血缘关系构建以表为维度，只能提供简单的展示，用户不能根据展示内容详细了解某个特征的情况。。2) The construction of blood relationship takes the table as the dimension, and can only provide a simple display. Users cannot learn about a feature in detail according to the display content. .

3)诸如归一化、异常值修正等特征加工方式的底层参数，没有完整纳入血缘关系体系之中，因此在进行特征工程复现时，生成的数据集只可作为训练集，不能作为测试集。3) The underlying parameters of feature processing methods such as normalization and outlier correction are not fully incorporated into the blood relationship system. Therefore, when performing feature engineering reproduction, the generated data set can only be used as a training set, not a test set.

专利申请201610127589.0公开了一种特征工程策略确定方法及装置，该方法通过获取用于训练预设模型的预设维度特征的多个特征值；根据所述多个特征值的排序确定多个分位区间；获取每个分位区间中作为正样本的特征值的数量与所在区间内所有特征值的数量的正样本比例；计算任意两个相邻的所述分位区间的所述正样本比例之间的正样本变化率；根据所有分位区间对应的所述正样本比例之间的正样本变化率，可以确定用于处理所述预设维度特征的目标特征工程策略。Patent application 201610127589.0 discloses a method and device for determining a feature engineering strategy. The method obtains multiple eigenvalues of a preset dimension feature used for training a preset model; and determines multiple quantiles according to the ordering of the multiple eigenvalues interval; obtain the positive sample ratio of the number of eigenvalues as positive samples in each quantile interval and the number of all eigenvalues in the interval; calculate the sum of the positive sample proportions in any two adjacent quantile intervals According to the positive sample change rate between the positive sample ratios corresponding to all quantile intervals, the target feature engineering strategy for processing the preset dimension feature can be determined.

又如专利申请201810669281.8公开了一种用于构建机器学习建模过程的方法及系统。所述方法包括：在用于构建机器学习建模过程的图形界面中显示构建的机器学习建模过程；响应于用于运行所述机器学习建模过程中的至少一个步骤的用户操作，运行所述至少一个步骤；在运行所述至少一个步骤的同时，接收用户的用于修改所述机器学习建模过程的修改操作；响应于所述修改操作，对所述机器学习建模过程进行修改，其中，当运行到所述机器学习建模过程的被修改的部分时，基于修改后的机器学习建模过程来运行。Another example is the patent application 201810669281.8 which discloses a method and system for constructing a machine learning modeling process. The method includes: displaying the constructed machine learning modeling process in a graphical interface for constructing the machine learning modeling process; and, in response to a user operation for running at least one step in the machine learning modeling process, running all the machine learning modeling procedures. at least one step; while running the at least one step, receiving a modification operation of the user for modifying the machine learning modeling process; in response to the modification operation, modifying the machine learning modeling process, Wherein, when running to the modified part of the machine learning modeling process, the running is based on the modified machine learning modeling process.

发明内容SUMMARY OF THE INVENTION

为解决上述问题，本发明提供一种基于特征工程血缘关系的可回溯、可视化方法，该方法能够帮助用户在进行特征工程后，对结果变量进行溯源，以及将特征工程的过程完整地进行可视化交互展现。In order to solve the above problems, the present invention provides a traceability and visualization method based on the blood relationship of feature engineering. The method can help users to trace the source of the result variables after the feature engineering is performed, and the process of the feature engineering can be completely visualized and interacted. show.

本发明的另一个目的在于提供一种基于特征工程血缘关系的可回溯、可视化方法，该方法能完整记录基于全流程的数据处理流，解决用户对最终生成特征的来源存在困惑的问题，同时，以特征为维度的清晰的血缘结构，使用户在选取对建模有用的结果变量时，能通过血缘关系结构复现逻辑，能够清晰且快速地进行模型开发，跨平台上线。Another object of the present invention is to provide a traceable and visualized method based on the blood relationship of feature engineering, which can completely record the data processing flow based on the whole process, solve the problem that users are confused about the source of the final generated feature, and at the same time, The clear blood relationship structure with features as the dimension enables users to reproduce the logic through the blood relationship structure when selecting the result variables that are useful for modeling, and can develop models clearly and quickly, and launch them across platforms.

为实现上述目的，本发明的技术方案如下:For achieving the above object, technical scheme of the present invention is as follows:

一种基于特征工程血缘关系的可回溯、可视化方法，所述方法包括：A traceability and visualization method based on feature engineering blood relationship, the method includes:

步骤一、数据库宽表整合的血缘构建：所述数据库宽表整合为按聚合、关联、抽取、衍生等机制，按预先定义的表间关系，将多表整合为用于建模的宽表的过程。所述聚合为根据数据表唯一键，对其它特征列采用汇聚计算；所述关联为对表与表，以某共同列进行横向连接；所述抽取为坐表与右表，一对多关联时，抽取右表中的一条记录与左表关联；所述衍生为根据业务规则，对特征列进行聚合或列与列间组合计算；涉及上述处理的字段和表，依据字段所属表，以及操作内容进行血缘构建；Step 1. Consanguinity construction of database wide table integration: The database wide table integration is based on mechanisms such as aggregation, association, extraction, and derivation, and according to predefined relationships between tables, multiple tables are integrated into wide tables for modeling. process. The aggregation is to use aggregation calculation for other feature columns according to the unique key of the data table; the association is to horizontally connect the table and the table with a common column; the extraction is the seat table and the right table, when one-to-many association is used. , extract a record in the right table and associate it with the left table; the derivation is to perform aggregation of characteristic columns or combination calculation between columns and columns according to business rules; the fields and tables involved in the above processing are based on the table to which the field belongs and the content of the operation. bloodline construction;

进一步，在宽表整合过程中，根据每个特征的聚合和衍生业务规则方式的不同，分别以各个单一特征为维度，记录每个变量的聚合和衍生规则以及上一级特征，并输出标准可回溯的数据结构。Further, in the process of wide table integration, according to the different ways of aggregation and derivation of business rules for each feature, each single feature is used as a dimension to record the aggregation and derivation rules of each variable and the upper-level features, and the output standard can be Backtracking data structure.

步骤二、数据预处理血缘构建：所述数据为步骤一中由数据库产出形成的宽表数据，所述数据预处理为对该宽表的特征进行常见特征工程处理方式的加工，包括但不限于变量删除、归一化、填补缺失值、异常值修正、独热编码、标准化、多种类分箱以及自定义衍生，受限于不同的操作内容，该步骤血缘主要涉及操作内容和操作底层参数；Step 2, data preprocessing bloodline construction: the data is the wide table data formed by the database output in step 1, and the data preprocessing is the processing of common feature engineering processing methods for the features of the wide table, including but not Limited to variable deletion, normalization, filling of missing values, outlier correction, one-hot encoding, standardization, multi-class binning and custom derivation, limited by different operation contents, the blood relationship in this step mainly involves the operation content and the underlying parameters of the operation ;

进一步，本系统特征工程的处理方式为可视化的加工，即处理过程根据处理方式原理的不同，按单个特征维度，依次对所涉及的各个操作进行底层参数进行纪录，输出标准可回溯的数据结构。Further, the processing method of the feature engineering of this system is visual processing, that is, the processing process records the underlying parameters of each operation involved in turn according to the different processing methods and principles, according to a single feature dimension, and outputs a standard traceable data structure.

步骤三、血缘关系整合：由于数据处理一般为分块进行，以上述为例，最终对两部分数据结构进行匹配和汇总，数据预处理的源变量为数据库聚合与衍生的结果变量，最终形成以数据预处理结果变量为维度的血缘关系结构数据。Step 3. Consanguinity integration: Since data processing is generally carried out in blocks, taking the above as an example, the two parts of the data structure are finally matched and summarized. The source variables of data preprocessing are the result variables of database aggregation and derivation. The result variable of data preprocessing is dimensional blood relationship structure data.

步骤四、可视化交互构建：以步骤三输出完整血缘为基础，按多层面多级别进行可视化交互构建，包括表级别(目标字段来源于何表，体现字段与表的关系)，字段级别(目标字段来源于何字段，体现字段与字段的来源关系)，记录级别(目标记录来源于何记录，体现记录与记录)，从头至尾(由源字段找到目标字段)，从尾至头(由结果字段找到源字段)，层级选择由用户交互选择。Step 4. Visual interaction construction: Based on the complete bloodline output in step 3, visual interaction construction is carried out at multiple levels and levels, including table level (what table the target field comes from, reflecting the relationship between the field and table), field level (target field From what field, reflecting the source relationship between fields), record level (what record the target record comes from, reflecting the record and the record), from beginning to end (from the source field to find the target field), from the end to the beginning (by the result field find the source field), the level selection is selected interactively by the user.

本发明解决了当前行业特征工程结果变量难回溯、错误数据的源头难查找问题，同时解决特征工程实验和生产脱节，变量处理难以在生产环境复现的问题和展示问题，最终使得整体流程能基于血缘关系结构快速展示和复现，快速应用于生产环境。The invention solves the problems that the current industry feature engineering result variables are difficult to trace back and the source of the wrong data is difficult to find, and at the same time solves the problem that the feature engineering experiment and production are disconnected, and the variable processing is difficult to reproduce in the production environment and display problems, and finally the overall process can be based on The blood relationship structure can be quickly displayed and reproduced, and it can be quickly applied to the production environment.

本发明所实现的方案步骤如下：The scheme steps realized by the present invention are as follows:

步骤一、数据库宽表整合的血缘构建；具体地，数据的源头在于数据库的各个表，表的构建存在原子性，因此形成一张宽表，需要经过多个表进行聚合和关联得到，包括有：Step 1. The bloodline construction of database wide table integration; specifically, the source of data lies in each table of the database, and the construction of the table is atomic, so to form a wide table, it needs to be aggregated and associated with multiple tables, including:

101、在采用聚合和关联时，需要预先定义多个表之间的关联关系，以及定义特征之间的业务关系；关联关系包括但不限于内关联、左关联、右关联等多种方式，特征业务关系类型包括但不限于交易流水类型、通话记录类型、短信记录类型等多种类型；101. When using aggregation and association, it is necessary to pre-define the association relationship between multiple tables, and define the business relationship between features; association relationships include but are not limited to internal associations, left associations, right associations and other methods. Business relationship types include but are not limited to transaction flow types, call record types, SMS record types, etc.;

102、采用数据库聚合(数据库聚合特征的加工方式主要有：最大值聚合、最小值聚合、求和聚合、计数聚合、标准差聚合、均值聚合等几种方式，以客户为维度，分别表示某个客户某个特征的总和、非空数量、最大值、最小值、标准差、均值)的方式，对数据进行汇总计算；102. Use database aggregation (the processing methods of database aggregation features mainly include: maximum value aggregation, minimum value aggregation, summation aggregation, count aggregation, standard deviation aggregation, mean aggregation, etc., taking the customer as the dimension, respectively representing a certain Summarize the data by means of the sum, non-null quantity, maximum value, minimum value, standard deviation, mean) of a certain characteristic of the customer;

103、采取数据库抽取的方式，仅在左表与右表关联时，右表记录相对于关联字段而言，记录不唯一时采用，抽取某一条右表记录与左表关联；103. The method of database extraction is adopted. Only when the left table is associated with the right table, the right table record is not unique compared to the associated field, and a certain right table record is extracted to be associated with the left table;

104、在采取数据库衍生的方式时，依据预先设定的特征之间的交互计算规则，对特征进行衍生处理，形成新特征，新特征具备相关的业务含义，存在实际行业意义；104. When adopting the method of database derivation, according to the interactive calculation rules between the preset features, the features are derived to form new features, and the new features have relevant business meanings and have practical industry significance;

在进行数据库聚合和关联时，每个特征的加工方式都有不同，根据不同，分别构建各个特征的血缘关系，每个当前特征，都是上一级表的某个特征，经过聚合操作来的，所以上一级表是哪个，特征是哪个都要记录，这就是血缘关系，后面才能根据记录的信息(也就是血缘关系)用别的方式进行复现。血缘关系包括所有特征的上一级表、上一级特征，聚合类型、上一级表关联字段以及上一级关联表，血缘关系包括但不限于以上内容，一切以血缘关系可溯源以及在生产环境可视化为基础。When performing database aggregation and association, each feature is processed in different ways. According to the difference, the blood relationship of each feature is constructed separately. Each current feature is a feature of the upper-level table, which is obtained through aggregation operations. , so which is the upper-level table and which features are recorded, this is the blood relationship, and can be reproduced in other ways according to the recorded information (that is, the blood relationship). The blood relationship includes the upper-level table of all features, the upper-level feature, the aggregation type, the associated fields of the upper-level table, and the upper-level association table. The blood relationship includes but is not limited to the above contents. Environment visualization as the basis.

步骤二、宽表特征数据预处理与血缘构建，包括有：Step 2: Preprocessing of wide table feature data and construction of blood relationship, including:

105、步骤一产生的宽表，是步骤二的输入，获取数据库加工成的宽表，按特征列的形式进行数据拆分；105. The wide table generated in step 1 is the input of step 2, obtain the wide table processed by the database, and perform data splitting in the form of feature columns;

按特征列进行拆分，主要是为了做并行化的处理，拆分规则是按照尽量保证每个并行任务运行时长一致，同时保证服务器的性能同时运行并行化的任务而不宕机或者内存溢出，以便进行系统并行化处理、加速处理过程，拆分规则以加速处理速度，平均处理时间为原则。Splitting by feature column is mainly for parallel processing. The splitting rule is to ensure that the running time of each parallel task is consistent as much as possible, and at the same time to ensure that the performance of the server runs parallel tasks at the same time without downtime or memory overflow. In order to parallelize the system and speed up the processing process, the splitting rules are based on the principles of accelerating the processing speed and the average processing time.

进一步，对每个不同的特征列，分别预先定义属于该特征所需要进行的数据预处理操作。具体地说，是按特征记录每个操作的内容，构建每个特征的预处理操作流。由于当前为预先定义操作流，因此并未开始进行数据预处理操作，此处记录所有特征为各自特征的原始特征，以便知道经过预处理后每个特征的原始特征。Further, for each different feature column, the data preprocessing operations that need to be performed for the feature are respectively predefined. Specifically, the content of each operation is recorded by feature, and the preprocessing operation flow of each feature is constructed. Since the current operation flow is pre-defined, the data preprocessing operation is not started. Here, all the features are the original features of their respective features, so that the original features of each feature after preprocessing can be known.

106、开始特征的预处理，在特征进行预处理过程中，实时记录数据预处理操作所涉及的底层参数，在完成数据预处理时，汇总特征的处理操作和参数流，形成每个特征当前的血缘关系结构。106. Start feature preprocessing. During feature preprocessing, record the underlying parameters involved in data preprocessing operations in real time. When data preprocessing is completed, summarize feature processing operations and parameter streams to form the current feature flow for each feature. blood relationship structure.

特别注意，在数据预处理过程中，单一源特征经过某次种预处理可能产生一个或多个中间特征，在记录这些中间特征的操作和参数流时，应该继承该单一源特征的操作和参数流，在此基础上进行纪录和构建。对结果特征而言，它的血缘记录应是完备的，从源到结果的。In particular, in the process of data preprocessing, a single source feature may generate one or more intermediate features after a certain preprocessing. When recording the operations and parameter flow of these intermediate features, the operations and parameters of the single source feature should be inherited. Stream, record and build on this basis. For the result feature, its bloodline record should be complete, from source to result.

步骤三、血缘关系整合。每个特征经过步骤一和二的加工，都会产生两类血缘关系，区别在于由于两种不同的处理方式，所以记录的血缘信息和结构都会有所不同。但是步骤一输出的宽表，是步骤二的输入，因此存在步骤二的原始特征是步骤一结果特征的关系，依据此关系，从步骤二的结果特征可以找到该特征于步骤二的原始特征，由于该原始特征属于步骤一的结果特征，因此又可找到此特征于步骤一的原始特征。具体包括：The third step is the integration of blood relationship. After each feature is processed in steps 1 and 2, two types of blood relationship will be generated. The difference is that due to the two different processing methods, the recorded blood relationship information and structure will be different. However, the wide table output from step 1 is the input of step 2, so there is a relationship between the original feature of step 2 and the result feature of step 1. According to this relationship, from the result feature of step 2, the original feature of step 2 can be found. Since the original feature belongs to the result feature of step 1, the original feature of step 1 can be found again. Specifically include:

107、通过对两个步骤中结果特征的匹配和拼接，最终成功构建每个特征的血缘关系，每个步骤的结果特征均能找寻到对应源特征。107. Through the matching and splicing of the result features in the two steps, the blood relationship of each feature is finally successfully constructed, and the corresponding source features can be found for the result features of each step.

该血缘关系体现某个结果特征从数据库某张表的某个源特征到最终结果特征的全流程，具备相当的完备性，该结构可用于可视化展示的开发以及生产特征加工复现。The blood relationship reflects the whole process of a result feature from a source feature of a certain table in the database to the final result feature, and it has considerable completeness. This structure can be used for the development of visual display and the reproduction of production feature processing.

步骤四、特征血缘回溯与可视化交互。由于特征血缘的完备性和血缘关系结构的合理性，血缘可快速应用于开发以及可视化交互。血缘结构完备性和合理性体现在某个表的某个原始特征，根据该血缘结构，可以脱离数据库，从原始特征采用血缘结构定义的处理，得到结果特征。Step 4. Feature blood relationship backtracking and visual interaction. Due to the completeness of the characteristic blood relationship and the rationality of the blood relationship structure, blood relationship can be quickly applied to development and visual interaction. The completeness and rationality of the blood relationship structure is reflected in a certain original feature of a certain table. According to the blood relationship structure, it can be separated from the database, and the result feature can be obtained from the original feature using the processing defined by the blood relationship structure.

108、通过在任何计算机编程语言快速复现处理逻辑，从原始特征，快速生成结果特征，并可进行回溯，再以多层级的架构提供可视化交互展示。108. By quickly reproducing the processing logic in any computer programming language, from the original features, the resulting features can be quickly generated, and backtracking can be performed, and then a multi-level architecture provides a visual interactive display.

进行可视化交互展示依据步骤三中完整的血缘关系为主，交互式展示覆盖多层级和多顺序，层级与顺序的切换由交互决定；该层级包括：表层级、字段层级、记录层级；可视化顺序包括从结果字段回溯源字段，从源字段定位结果字段。The visual interactive display is mainly based on the complete blood relationship in step 3. The interactive display covers multiple levels and multiple orders, and the switching between levels and orders is determined by the interaction; the level includes: table level, field level, and record level; the visualization order includes Trace back to the source field from the result field, locate the result field from the source field.

由于血缘以每个特征为维度，因此，用户需要某个特征，就可以单独抽出该特征的血缘关系，进行该特征的复现。血缘关系的设计和存储，遵循了大部分计算机编程语言均可解析的结构，方便各种语言进行解析和生产复现。Since blood relationship is dimensioned by each feature, if a user needs a certain feature, the blood relationship of the feature can be extracted separately to reproduce the feature. The design and storage of blood relationship follows the structure that can be parsed by most computer programming languages, which is convenient for parsing and production reproduction in various languages.

相比于现有技术，本发明采用本方案构建的血缘关系，能完整记录基于全流程的数据处理流，解决用户对最终生成特征的来源存在困惑的问题，方便用户溯源。同时，可视化交互展示，能使用户快速定位数据问题，以特征为维度的清晰的血缘结构，使用户在选取对建模有用的结果变量时，能通过可视化交互界面，复现特征处理逻辑，只加工生成所需要的特定结果变量，最终达到清晰且快速地进行模型开发，跨平台上线。Compared with the prior art, the present invention adopts the blood relationship constructed by this scheme, which can completely record the data processing flow based on the whole process, solves the problem of user confusion about the source of the final generated feature, and facilitates the user to trace the source. At the same time, the visual interactive display enables users to quickly locate data problems, and the clear blood relationship structure with features as the dimension enables users to reproduce the feature processing logic through the visual interactive interface when selecting the result variables that are useful for modeling. The specific result variables required for processing and generation, and finally achieve a clear and rapid model development, and cross-platform launch.

附图说明Description of drawings

图1是本发明所实施的字段血缘关系构建示意图。FIG. 1 is a schematic diagram of field blood relationship construction implemented by the present invention.

图2是本发明所实施的血缘可视化交互界面基本架构图。FIG. 2 is a basic architecture diagram of a bloodline visualization interactive interface implemented by the present invention.

图3是发明所实施的由原始特征到结果特征的特征工程复现示意图。FIG. 3 is a schematic diagram of feature engineering reproduction from original features to result features implemented by the invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

结合图1-3所示，本发明方案主要包括四部分内容，第一部分为数据从原始数据库，通过聚合的方式，产出宽表，并进行数据库聚合相关的血缘关系构建；第二部为第一部分的后续，即对数据库产出宽表进行特征数据预处理，进行预处理相关的血缘关系构建；第三部分为血缘关系整合，在完成第一部分和第二部分基础上，整合两者血缘关系，形成以每个特征为维度的、合理可视化可溯源的血缘关系结构。第四部分为血缘关系的可视化交互界面展示以及基于血缘关系的特征快速复现和回溯，在血缘关系基础上，梳理逻辑关系，提供用户直观的血缘结构，快速进行变量溯源，并依据结构，将处理流程快速复制于生产环境。As shown in Figures 1-3, the solution of the present invention mainly includes four parts, the first part is that the data is from the original database, through the method of aggregation, the wide table is generated, and the blood relationship related to database aggregation is constructed; the second part is the first part. Part of the follow-up is to preprocess the feature data of the database output wide table, and construct the blood relationship related to the preprocessing; the third part is the integration of blood relationship. On the basis of completing the first part and the second part, integrate the blood relationship between the two. , forming a blood relationship structure with each feature as the dimension, which is reasonably visualized and traceable. The fourth part is the visual interactive interface display of the blood relationship and the rapid recurrence and retrospect of the characteristics based on the blood relationship. The processing flow is quickly replicated in the production environment.

字段血缘关系构建示意图如图1所示，图1展示了字段血缘关系构建的三个步骤，图1简要展示了包含表、表的特征、特征处理类型在内的多种信息。Figure 1 shows the schematic diagram of field blood relationship construction. Figure 1 shows the three steps of field blood relationship construction. Figure 1 briefly shows various information including tables, table features, and feature processing types.

具体地说，结合图2和图3所示，本发明所实现的方案步骤如下：Specifically, in conjunction with Fig. 2 and Fig. 3, the scheme steps realized by the present invention are as follows:

步骤一、数据库聚合与血缘构建。Step 1: Database aggregation and bloodline construction.

首先，数据的源头在于数据库的各个表，表的构建存在原子性，因此形成一张宽表，需要经过多个表进行聚合和关联得到。在采用聚合和关联时，需要预先定义多个表之间的关联关系，关联关系包括但不限于内关联、左关联、右关联等多种方式，以及定义特征之间的业务关系(特征之间的业务关系比较多比较复杂，举个例子：某一列表示某个物品单价，某一列表示客户购买数量，相乘可以得到客户花费总价钱。诸如此类的，总结的一些类似的业务关系，包括但不限于交易流水类型、通话记录类型、短信记录类型等，业务关系就是这种特征和特征之间带有业务含义的组合的总称)，特征业务关系类型包括但不限于交易流水类型、通话记录类型、短信记录类型等多种类型，采用数据库聚合(数据库聚合特征的加工方式主要有：求和聚合、计数聚合、最大值聚合、最小值聚合、标准差聚合、均值聚合等几种方式，以客户为维度，分别表示某个客户某个特征(可以是总价或数量或者其他)的总和、非空数量、最大值、最小值、标准差、均值)的方式，对数据进行汇总计算，汇总计算包括但不限于求和、计数、最大值、最小值、标准差、均值等多种方式。First of all, the source of data lies in each table of the database. The construction of the table is atomic. Therefore, to form a wide table, it needs to be aggregated and associated with multiple tables. When using aggregation and association, it is necessary to pre-define the association relationship between multiple tables, including but not limited to inner association, left association, right association, etc., as well as defining the business relationship between features (between features For example, a certain column represents the unit price of an item, and a certain column represents the customer's purchase quantity. Multiplying the customer can get the total price spent by the customer. And so on, summarize some similar business relationships, including but not Limited to transaction flow type, call record type, SMS record type, etc., business relationship is the general term for the combination of such features and features with business meaning), characteristic business relationship types include but are not limited to transaction flow type, call record type, SMS record types and other types, using database aggregation (the processing methods of database aggregation features mainly include: summation aggregation, count aggregation, maximum value aggregation, minimum value aggregation, standard deviation aggregation, mean aggregation, etc. Dimension, which represents the sum, non-empty quantity, maximum value, minimum value, standard deviation, mean value of a certain characteristic of a customer (can be total price or quantity or other), and summarizes the data. The summary calculation includes But not limited to sum, count, maximum value, minimum value, standard deviation, mean and other methods.

在进行数据库聚合和关联时，每个特征的加工方式都有不同，根据不同，分别构建各个特征的血缘关系，每个当前特征，都是上一级表的某个特征，经过聚合操作来的，所以上一级表是哪个，特征是哪个都要记录，后面才能根据记录的信息用别的方式进行复现。举个例子：某个客户的交易记录是多条记录(此处可以理解为表)，经过求和聚合，就知道该客户一共有多少条记录，那么这一共有多少条就是一个新的特征，他是在多条记录上经过求和聚合加工形成的新特征。血缘关系包括所有特征(包括中间加工特征以及最终特征：中间特征可能由原始特征或者上一级中间特征加工得来，最终特征可能由中间特征或者原始特征加工得来)的上一级表、上一级特征，聚合类型(聚合类型其实就是上面的聚合特征加工方式)、上一级表关联字段以及上一级关联表，血缘关系包括但不限于以上内容，一切以血缘关系可溯源(清晰了解某个最终特征如何得到)以及在生产环境可视化(依据血缘关系，脱离当前开发环境，也能进行过程复现)为基础。When performing database aggregation and association, each feature is processed in different ways. According to the difference, the blood relationship of each feature is constructed separately. Each current feature is a feature of the upper-level table, which is obtained through aggregation operations. , so which is the upper-level table and which is the characteristic must be recorded, and then can be reproduced in other ways according to the recorded information. For example: a customer's transaction records are multiple records (here can be understood as a table), after summation and aggregation, we know how many records the customer has in total, then the total number of records is a new feature, It is a new feature formed by summation aggregation over multiple records. The blood relationship includes all features (including intermediate processing features and final features: intermediate features may be processed from original features or upper-level intermediate features, and final features may be processed from intermediate features or original features). First-level features, aggregation type (aggregation type is actually the above-mentioned aggregation feature processing method), upper-level table associated fields and upper-level associated tables, blood relationship includes but not limited to the above content, all traceable by blood relationship (clearly understand based on how a certain final feature is obtained) and visualization in the production environment (according to the blood relationship, the process can be reproduced without the current development environment).

步骤二、宽表特征数据预处理与血缘构建。Step 2: Preprocessing of wide table feature data and construction of blood relationship.

从整个方案看，步骤一产生的宽表是第二步骤的前置，没有步骤一就无法进行步骤二，举个例子：步骤一产出了一个有关客户的表，每一行表示一个客户，每一列(特征)表示客户的基本信息(年龄，性别等)，如果这个表的某个客户的年龄是缺失的(在数据库是没有的)，那么需要进行填补(数据预处理的一种)，将所有客户的年龄求平均值，用这个值进行填补。From the perspective of the whole scheme, the wide table generated in step 1 is the pre-step of step 2. Step 2 cannot be performed without step 1. For example, step 1 produces a table about customers, each row represents a customer, and each row represents a customer. A column (feature) represents the basic information of the customer (age, gender, etc.), if the age of a customer in this table is missing (not in the database), then it needs to be filled (a kind of data preprocessing), will The age of all customers is averaged and filled with this value.

步骤一产生的宽表，是步骤二的输入。获取数据库加工成的宽表，按特征列的形式进行数据拆分。按特征列进行拆分，主要是为了做并行化的处理，拆分规则是按照尽量保证每个并行任务运行时长一致，同时保证服务器的性能能同时运行并行化的任务而不宕机或者内存溢出，以便进行系统并行化处理，加速处理过程，拆分规则以加速处理速度，平均处理时间为原则。The wide table generated in step 1 is the input of step 2. Obtain the wide table processed by the database, and split the data in the form of feature columns. Splitting by feature column is mainly for parallel processing. The splitting rule is to ensure that the running time of each parallel task is consistent as much as possible, and at the same time to ensure that the performance of the server can run parallel tasks at the same time without downtime or memory overflow , in order to parallelize the system and speed up the processing process. The splitting rules are based on the principle of accelerating the processing speed and the average processing time.

其次，对每个不同的特征列，分别预先定义属于该特征所需要进行的数据预处理操作。按特征记录每个操作的内容，构建每个特征的预处理操作流。上面说到一个特征可以进行填补，那填补完之后肯定还可以进行其他的操作，例如异常值剔除等，一个特征有属于他自己的操作流。由于当前为预先定义操作流，因此并未开始进行数据预处理操作，此处记录所有特征为各自特征的原始特征，以便知道经过预处理后每个特征的原始特征。Secondly, for each different feature column, pre-define the data preprocessing operations that need to be performed for the feature. Record the content of each operation by feature, and build a stream of preprocessing operations for each feature. As mentioned above, a feature can be filled, and other operations can be performed after filling, such as outlier removal, etc. A feature has its own operation flow. Since the current operation flow is pre-defined, the data preprocessing operation is not started. Here, all the features are the original features of their respective features, so that the original features of each feature after preprocessing can be known.

开始特征的预处理，在特征进行预处理过程中，根据特征进行的数据预处理操作的不同，实时记录数据预处理操作所涉及的底层参数(底层参数与实际数据相关，因此无法在预先定义时记录)，在完成数据预处理时，汇总特征的处理操作和参数流，形成每个特征当前的血缘关系结构。Begin feature preprocessing. During feature preprocessing, the underlying parameters involved in the data preprocessing operation are recorded in real time according to the data preprocessing operations performed by the feature (the underlying parameters are related to the actual data, so they cannot be defined in advance. record), when the data preprocessing is completed, the processing operations and parameter flow of the features are summarized to form the current blood relationship structure of each feature.

步骤三、血缘关系整合。The third step is the integration of blood relationship.

每个特征经过步骤一和二的加工，都会产生两类血缘关系，区别在于由于两种不同的处理方式，所以记录的血缘信息和结构都会有所不同。After each feature is processed in steps 1 and 2, two types of blood relationship will be generated. The difference is that due to the two different processing methods, the recorded blood relationship information and structure will be different.

但是步骤一输出的宽表，是步骤二的输入，因此存在步骤二的原始特征是步骤一结果特征的关系，依据此关系，从步骤二的结果特征可以找到该特征于步骤二的原始特征，由于该原始特征属于步骤一的结果特征，因此又能找到给特征于步骤一的原始特征。However, the wide table output from step 1 is the input of step 2, so there is a relationship between the original feature of step 2 and the result feature of step 1. According to this relationship, from the result feature of step 2, the original feature of step 2 can be found. Since the original feature belongs to the result feature of step one, the original feature given to step one can also be found.

依据上述关系，通过对两个步骤中结果特征的匹配和拼接，最终成功构建每个特征的血缘关系，每个步骤的结果特征均能找寻到对应源特征。该血缘关系体现某个结果特征从数据库某张表的某个源特征到最终结果特征的全流程，具备相当的完备性，该结构可用于可视化展示的开发以及生产特征加工复现。According to the above relationship, through the matching and splicing of the result features in the two steps, the blood relationship of each feature is finally successfully constructed, and the corresponding source feature can be found for the result feature of each step. The blood relationship reflects the whole process of a result feature from a source feature of a certain table in the database to the final result feature, and it has considerable completeness. This structure can be used for the development of visual display and the reproduction of production feature processing.

步骤四、特征血缘回溯与可视化交互。Step 4. Feature blood relationship backtracking and visual interaction.

由于特征血缘的完备性和血缘关系结构的合理性，血缘可快速应用于开发以及可视化交互。血缘结构完备性和合理性体现在某个表的某个原始特征，根据该血缘结构，可以脱离数据库，从原始特征采用血缘结构定义的处理，得到结果特征。因此可以通过在任何计算机编程语言快速复现处理逻辑，从原始特征，快速生成结果特征，并可进行回溯，再以多层级的架构提供可视化交互展示。Due to the completeness of the characteristic blood relationship and the rationality of the blood relationship structure, blood relationship can be quickly applied to development and visual interaction. The completeness and rationality of the blood relationship structure is reflected in a certain original feature of a certain table. According to the blood relationship structure, it can be separated from the database, and the result feature can be obtained from the original feature using the processing defined by the blood relationship structure. Therefore, by quickly reproducing the processing logic in any computer programming language, the resulting features can be quickly generated from the original features, and backtracking can be performed, and then a visual interactive display can be provided in a multi-level architecture.

总之，本发明采用本方案构建的血缘关系，能完整记录基于全流程的数据处理流，解决用户对最终生成特征的来源存在困惑的问题，方便用户溯源。同时，以特征为维度的清晰的血缘结构，使用户在选取对建模有用的结果变量时，能通过血缘关系结构复现逻辑，只加工生成所需要的特定结果变量，最终达到清晰且快速地进行模型开发，跨平台上线。In a word, the present invention adopts the blood relationship constructed by this scheme, which can completely record the data processing flow based on the whole process, solve the problem that users are confused about the source of the final generated feature, and facilitate the user to trace the source. At the same time, the clear blood relationship structure with features as the dimension enables users to reproduce the logic through the blood relationship structure when selecting the result variables that are useful for modeling, and only process and generate the specific result variables required, and finally achieve a clear and fast result. Model development and cross-platform launch.

以上列举了本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。The preferred embodiments of the present invention are listed above, but are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention should be included within the protection scope of the present invention. .

Claims

1. a retrospective, visualization method based on feature engineering blood relationship, it is characterized in that described method comprises:

Step 1. Consanguinity construction of database wide table integration: The database wide table integration is based on mechanisms such as aggregation, association, extraction, and derivation, and according to predefined relationships between tables, multiple tables are integrated into wide tables for modeling. process. The aggregation is to use aggregation calculation for other feature columns according to the unique key of the data table; the association is to horizontally connect the table and the table with a common column; the extraction is the left table and the right table, when a one-to-many association is used. , extract a record in the right table and associate it with the left table; the derivation is to perform aggregation of characteristic columns or combination calculation between columns and columns according to business rules; the fields and tables involved in the above processing are based on the table to which the field belongs and the content of the operation. Build bloodlines.

Step 2, data preprocessing bloodline construction: the data is the wide table data formed by the database output in step 1, and the data preprocessing is the processing of common feature engineering processing methods for the features of the wide table, including but not Limited to variable deletion, normalization, filling of missing values, outlier correction, one-hot encoding, standardization, multi-class binning and custom derivation, limited by different operation contents, the blood relationship in this step mainly involves the operation content and the underlying parameters of the operation .

Step 3. Consanguinity integration: Since data processing is generally carried out in blocks, taking the above as an example, the two parts of the data structure are finally matched and summarized. The source variables of data preprocessing are the result variables of database aggregation and derivation. The result variable of data preprocessing is dimensional blood relationship structure data.

Step 4. Visual interaction construction: Based on the complete bloodline output in step 3, the visual interaction construction is carried out in multiple levels and in multiple orders, including table level, field level, record level, from beginning to end, from end to beginning, and the level selection is by user interaction choose.

2. The traceability and visualization method based on feature engineering blood relationship as claimed in claim 1, wherein in step 1, in the process of aggregation, association, extraction and derivation, according to the aggregation, association and derivative business of each feature Depending on the way of rules, each single feature is used as the dimension to record the aggregation, association and derivation rules of each variable, as well as the upper-level features, and output a standard traceable data structure.

3. the retrospective, visual method based on feature engineering blood relationship as claimed in claim 1, it is characterized in that in step 2, the processing mode of feature engineering is the processing of visualization, and the processing procedure is by single feature dimension, successively to the involved. Each operation and underlying parameters are recorded, and a standard traceable data structure is output.

4. the traceability, visualization method based on feature engineering blood relationship as claimed in claim 3, it is characterized in that the concrete steps of described method are as follows:

Step 1. The bloodline construction of database wide table integration, including:

101. When using aggregation and association, it is necessary to pre-define the association relationship between multiple tables, and define the business relationship between features; the association relationship includes but is not limited to inner association, left association, right association, and characteristic business relationship type Including but not limited to transaction flow type, call record type, SMS record type;

102. Aggregate data by means of database aggregation. Aggregation calculations include but are not limited to maximum value aggregation, minimum value aggregation, summation aggregation, count aggregation, standard deviation aggregation, mean aggregation and other methods;

103. The method of database extraction is adopted, only when the left table is associated with the right table, and the records of the right table are not unique compared to the associated fields;

104. When the method of database derivation is adopted, according to the interactive calculation rules between the preset features, the features are derived to form new features;

Step 2: Preprocessing of wide table feature data and construction of blood relationship, including:

105. Obtain the wide table processed by the database, and split the data in the form of characteristic columns;

106. Start feature preprocessing. During feature preprocessing, record the underlying parameters involved in data preprocessing operations in real time. When data preprocessing is completed, summarize feature processing operations and parameter streams to form the current feature flow for each feature. blood relationship structure.

Step 3. Consanguinity integration, including:

107. Through the matching and splicing of the result features in the two steps, the blood relationship of each feature is finally successfully constructed, and the corresponding source features can be found for the result features of each step;

Step 4. Visual interactive display of characteristic blood relationship, including:

108. Quickly reproduce processing logic through computer programming language, quickly generate resulting features from original features, and backtrack, and then provide visual interactive display in a multi-level architecture.

5. The traceability and visualization method based on feature engineering blood relationship as claimed in claim 4, characterized in that when performing database aggregation and association, each current feature is a certain feature of an upper-level table, and after aggregation This is the blood relationship. The blood relationship includes but is not limited to the previous-level table, previous-level feature, aggregation type, and previous-level table of all features. Associated fields and upper-level associated tables.

6. The retrospective and visualized method based on feature engineering blood relationship as claimed in claim 4, wherein in the step 105, for each different feature column, predefine the data predefinition that belongs to the feature. Processing operation; is to record the content of each operation by feature, and construct the preprocessing operation flow of each feature.

7. The traceability and visualization method based on feature engineering blood relationship as claimed in claim 4, characterized in that in the step 106, in the data preprocessing process, a single source feature may generate one or more after a certain preprocessing. For multiple intermediate features, when recording the operations and parameter flows of these intermediate features, they should inherit the operations and parameter flows of the single source feature, and record and construct on this basis.

8. The traceability and visualization method based on feature engineering blood relationship as claimed in claim 4, characterized in that in step 107, the blood relationship reflects a certain result feature from a certain source feature of a certain table in the database to the final The whole process of the result feature is quite complete, and the structure can be used for the development of visual display and the reproduction of production feature processing.

9. The traceability, visualization method based on feature engineering blood relationship as claimed in claim 4, it is characterized in that in described 108 steps, carry out visual interactive display according to the complete blood relationship in step 3 as the main, interactive display covers many. Hierarchy and multi-sequence, the switching of hierarchy and order is determined by interaction; the hierarchy includes: table level, field level, record level; visualization sequence includes tracing back to the source field from the result field, and locating the result field from the source field.