CN115757525A

CN115757525A - Column operator blood relationship construction method, server and computer readable storage medium

Info

Publication number: CN115757525A
Application number: CN202211526166.8A
Authority: CN
Inventors: 王明灿; 张乐; 徐保荣
Original assignee: Zhejiang Daying Technology Co ltd
Current assignee: Zhejiang Daying Technology Co ltd
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-03-07
Anticipated expiration: 2042-11-30
Also published as: CN115757525B

Abstract

The present invention provides a column operator lineage construction method, which is applied in the field of computer technology and includes the following steps: S1) generating a parsing tree ParseTree through Antlr parsing SQL; S2) designing an abstract syntax tree AST for building lineage chains from input columns to output columns Road and intermediate processing logic; S3) recursively traverse the parse tree ParseTree obtained by S1) to construct the abstract syntax tree AST designed by S2); S4) traverse the abstract syntax tree AST constructed by S3) and extract the column operator kinship model; S5) Traverse the lineage model of column operators and build point-edge relationships, and store the lineage of column operators into the graph database. Provides the most fine-grained SQL analysis capability to help users understand the operator-level lineage of data assets, including the direct source of columns, functions used, processing caliber, and indirect impacts, and solves the problem that users do not understand data assets and cannot effectively empower them business problem.

Description

Column operator lineage construction method, server, and computer-readable storage medium

技术领域technical field

本发明涉及计算机技术领域，尤其涉及一种列算子血缘构建方法、服务器、计算机可读存储介质。The present invention relates to the field of computer technology, in particular to a method for constructing lineage of column operators, a server, and a computer-readable storage medium.

背景技术Background technique

越来越多的企业需要进行数字化转型，通过数据分析为业务发展赋能，这已经是未来企业发展的必然路径。然而，为了适应企业发展，企业内部的数据加工方式、加工逻辑越来越复杂，随着时间的推移，经常会导致用户无法快速便捷的理解历史数据的加工逻辑，无法快速响应愈发敏捷的业务需求。为了不影响数据分析的敏捷实现，用户需要为加工脚本编写注释或者编写一整份完整的文档来备份当时加工数据的思考逻辑，以尽量避免在数据分析师分析数据时由于对数据的不理解，导致分析结果出错。但是这依赖于用户的自觉性，企业内的知识落地目前无法形成大规模化。More and more enterprises need to carry out digital transformation and empower business development through data analysis. This is already an inevitable path for future enterprise development. However, in order to adapt to the development of the enterprise, the internal data processing methods and processing logic of the enterprise are becoming more and more complex. As time goes by, users often cannot quickly and easily understand the processing logic of historical data, and cannot quickly respond to the increasingly agile business need. In order not to affect the agile implementation of data analysis, users need to write comments for processing scripts or write a complete document to back up the thinking logic of data processing at that time, so as to avoid ignorance of data when data analysts analyze data. lead to errors in the analysis results. However, this depends on the self-consciousness of users, and the implementation of knowledge in enterprises cannot currently be scaled up.

通过数据的血缘关系，可以查询数据资产的上下游血缘关系，现有的技术只能探查数据资产的上下游路径，比如库、表、列的上游是什么，下游是什么；现在的解决方法基本上都是：解析数据资产的加工逻辑脚本，形成表级或者列级的数据血缘，让用户通过可视化的方式及时了解数据的来源、应用，为后续的数据分析提供基础；上述方法的缺陷如下，采用如下代码：Through the blood relationship of data, the upstream and downstream blood relationship of data assets can be queried. The existing technology can only detect the upstream and downstream paths of data assets, such as what is the upstream and downstream of the library, table, and column; the current solution is basically All of the above: analyze the processing logic script of data assets, form table-level or column-level data lineage, let users understand the source and application of data in a timely manner through visualization, and provide a basis for subsequent data analysis; the defects of the above method are as follows, Use the following code:

INSERT INTO T1(C1)INSERT INTO T1(C1)

SELECT MAX(T2.C1+T3.C1)as C1 FROM T2,T3 WHERE T2.C2>1；SELECT MAX(T2.C1+T3.C1)as C1 FROM T2,T3 WHERE T2.C2>1;

其中表血缘只能展示出表与表之间的加工关系，例如T1表来源于上游T2、T3表，列级血缘能展示出T1.C1来源于T2.C1和T3.C1，而无法得知T1.C1是T2.C1与T3.C1相加再通过MAX函数得出，同时也无法得知T1.C1的数据范围受到了T2.C2的筛选影响。Among them, the table lineage can only show the processing relationship between tables. For example, the T1 table is derived from the upstream T2 and T3 tables, and the column-level lineage can show that T1.C1 is derived from T2.C1 and T3.C1, but it cannot be known. T1.C1 is obtained by adding T2.C1 and T3.C1 and then using the MAX function. At the same time, it is impossible to know that the data range of T1.C1 is affected by the screening of T2.C2.

因此，不同数据资产的加工逻辑在使用来源上可能是一致的，例如都使用的是相同的表或者列，但是数据资产真正的业务逻辑存在大量的转换规则，用户通过现有的方式无法直观了解数据每一次的转换、计算和移动等依赖关系如何影响数据，会导致用户理解不完整，仍需要在线下咨询开发人员，无法满足现有敏捷业务需求，即只知道列是从哪里来，但是不知道是怎么来的。Therefore, the processing logic of different data assets may be consistent in terms of usage sources, for example, they all use the same table or column, but there are a large number of conversion rules in the real business logic of data assets, which users cannot intuitively understand through existing methods How each conversion, calculation, and movement of data will affect the data will lead to incomplete understanding by users. It is still necessary to consult developers offline, which cannot meet the existing agile business needs, that is, only know where the columns come from, but not Know how.

另一方面，数据资产的直接数据来源和间接加工依赖的数据来源代表的是不同的数据含义和业务含义，通过现有方式，用户会发展上游来源和下游应用的范围大规模膨胀，无法直观感知数据的直接来源，由此导致数据分析无法正常进行，现在需要人工查询路径去记录加工逻辑，查询数据的来源是哪里；On the other hand, the direct data source of data assets and the data source that indirect processing relies on represent different data meanings and business meanings. Through the existing methods, users will develop a large-scale expansion of upstream sources and downstream applications, which cannot be intuitively perceived The direct source of the data, which leads to the failure of data analysis, now requires manual query path to record the processing logic, where is the source of the query data;

再者，仅仅基于列级血缘所进行的标签扩散也是不准确的，由于数据通过层层加工，其本身所携带的含义会发生变化，因为程序无法得知列是由怎样的加工得出的，通过标签扩散层层向下会大大降低扩散的精准度。Furthermore, the label diffusion based only on the column-level blood relationship is also inaccurate. Because the data is processed layer by layer, the meaning it carries will change, because the program cannot know how the column is processed. Diffusion layer by layer down through the label will greatly reduce the accuracy of the diffusion.

发明内容Contents of the invention

本发明为了解决上述技术问题，提供最细粒度的SQL解析能力，帮助用户了解数据资产的算子级血缘，包括列直接来源、使用的函数、加工口径以及受到的间接影响，解决用户对于数据资产的不理解而无法有效赋能业务的问题。In order to solve the above technical problems, the present invention provides the most fine-grained SQL parsing ability, helps users understand the operator-level lineage of data assets, including the direct source of the column, the function used, the processing caliber and the indirect influence received, and solves the user's concerns about data assets. The problem of not being able to effectively empower the business due to lack of understanding.

为了解决上述的主要技术问题采取以下技术方案实现：In order to solve the above-mentioned main technical problems, the following technical solutions are adopted to realize:

列算子血缘构建方法，包括如下步骤，The column operator lineage construction method includes the following steps,

S1)通过Antlr解析SQL生成解析树ParseTree；S1) parse SQL through Antlr to generate a parsing tree ParseTree;

S2)设计抽象语法树AST，用于构建输入列到输出列的血缘链路和中间的加工逻辑；S2) Design the abstract syntax tree AST, which is used to construct the lineage link from the input column to the output column and the processing logic in the middle;

S3)递归遍历S1)得到的解析树ParseTree构建S2)设计的抽象语法树AST；S3) recursively traverse the parse tree ParseTree that S1) obtains and build the abstract syntax tree AST of S2) design;

S4)遍历S3)构建好的抽象语法树AST并提取列算子血缘模型；S4) Traversing the abstract syntax tree AST constructed in S3) and extracting the column operator consanguinity model;

S5)遍历列算子血缘模型并构建点边关系，将列算子血缘存入图数据库。S5) Traversing the lineage model of the column operator and constructing a point-edge relationship, and storing the lineage of the column operator into the graph database.

优选的，S2)包括如下步骤：Preferably, S2) comprises the steps of:

S21)基于关系代数，将一完整SQL划分为至少一个段落，每一段落为一主干，设计一种树形结构Scope用于抽象出SQL相应的主干，并建立主干与主干之间的层级关系；S21) Based on relational algebra, a complete SQL is divided into at least one paragraph, each paragraph is a trunk, and a tree structure Scope is designed to abstract the corresponding trunk of SQL, and establish a hierarchical relationship between the trunk and the trunk;

S22)Scope分为输入型Scope和输出型Scope两类；S22) Scope is divided into two types: input type Scope and output type Scope;

其中：输入型Scope划分为ProjectScope、JoinScope、UnionScope和ScanScope，用于抽象出SQL语句中的输入部分；Among them: the input type Scope is divided into ProjectScope, JoinScope, UnionScope and ScanScope, which are used to abstract the input part of the SQL statement;

输出型Scope划分为CreateAsScope和InsertScope，用于抽象出SQL语句中的输出部分；The output type Scope is divided into CreateAsScope and InsertScope, which are used to abstract the output part of the SQL statement;

其中：每种Scope包含该Scope的类型、该Scope的别名以及该Scope对外暴露的字段，通过索引记录不同Scope之间的父子关系，设置Scope的外层主干集合为parentScopeList，即父Scope集合；设置Scope的内层主干集合为childrenScopeList，即子Scope集合，从而构建不同Scope之间的父子关系；Among them: each Scope includes the type of the Scope, the alias of the Scope, and the fields exposed to the outside of the Scope. The parent-child relationship between different Scopes is recorded through the index, and the outer backbone set of the Scope is set to parentScopeList, which is the parent Scope set; set The inner backbone collection of Scope is childrenScopeList, which is the collection of child Scopes, so as to build the parent-child relationship between different Scopes;

S23)对S22)划分出的6种Scope进行定义，用于和抽象语法树AST相对应；S23) define the 6 kinds of Scopes divided by S22), which are used to correspond to the abstract syntax tree AST;

S24)定义每个Scope的对外暴露字段为该Scope的持有字段集合，每个Scope的持有字段集合可以被其parentScopeList中的父Scope所引用；S24) Define the externally exposed fields of each Scope as the holding field set of the Scope, and the holding field set of each Scope can be referenced by the parent Scope in its parentScopeList;

S25)对持有字段集合中的每个持有字段附加一个来源属性S25) Attach a source attribute to each holding field in the holding field set

ExpressionOrigin来记录持有字段的来源信息，用于构建输入列与输出列之间的血缘链路；ExpressionOrigin is used to record the source information of the held field, which is used to build the lineage link between the input column and the output column;

S26)建立表达式，用于记录列的加工逻辑和列的来源；S26) Establish an expression for recording the processing logic of the column and the source of the column;

S27)每一个Scope使用其childrenScopeList中子Scope的持有字段信息填充当前Scope的持有字段信息，再向父Scope传递当前Scope的持有字段信息，从而构建输出列到来源列之间的加工链路，进而提取列算子血缘信息。S27) Each Scope uses the field information of the child Scope in its childrenScopeList to fill the field information of the current Scope, and then transfers the field information of the current Scope to the parent Scope, thereby constructing the processing chain between the output column and the source column road, and then extract the lineage information of the column operator.

优选的，S3)包括如下步骤：Preferably, S3) comprises the steps of:

S31)从根节点开始递归遍历S1)生成的解析树ParseTree；S31) recursively traverse the parse tree ParseTree generated by S1) from the root node;

S32)依据SQL的类型，将根节点解释为CreateAsScope或InsertScope，之后遍历子节点；S32) interpret the root node as CreateAsScope or InsertScope according to the type of SQL, and then traverse the child nodes;

S33)若节点类型为查询节点select xx from，则将其解释为ProjectScope,并构建SELECT、WHERE、GROUP BY、HAVING、ORDER BY后的表达式；解析其FROM后的子节点为对应的Scope，并加入当前ProjectScope的childScopeList中，等待childScopeList中Scope的持有字段信息填充当前ProjectScope的表达式的来源信息，再使用表达式信息填充当前ProjectScope的持有字段；S33) If the node type is the query node select xx from, it is interpreted as ProjectScope, and the expression after SELECT, WHERE, GROUP BY, HAVING, ORDER BY is constructed; the child node after its FROM is parsed as the corresponding Scope, and Add it to the childScopeList of the current ProjectScope, wait for the field information of the Scope in the childScopeList to fill in the source information of the expression of the current ProjectScope, and then use the expression information to fill the field of the current ProjectScope;

S34)若FROM后的子节点为物理表节点,则创建ScanScope置为当前Scope，并依据元数据信息填充当前Scope的持有字段；S34) If the child node after FROM is a physical table node, then create a ScanScope and set it as the current Scope, and fill in the holding field of the current Scope according to the metadata information;

若FROM后的子节点为子查询节点，则跳至S33)进行递归处理；If the sub-node after FROM is a sub-query node, then skip to S33) for recursive processing;

若FROM后的子节点包含一个以上的JOIN节点，则创建JoinScope置为当前Scope，构建所有的ON条件后的表达式，并将每个JOIN两侧的JoinItem逐个跳至S33)进行递归处理；最终JoinScope的持有字段以childScopeList中每个子Scope的持有字段打宽的方式构建，并填充所有ON表达式的来源信息；If the child node after FROM contains more than one JOIN node, then create a JoinScope and set it as the current Scope, construct all the expressions after the ON condition, and skip the JoinItem on both sides of each JOIN one by one to S33) for recursive processing; finally The holding field of JoinScope is constructed in the way of widening the holding field of each child Scope in childScopeList, and fills in the source information of all ON expressions;

若FROM后的子节点包含一个以上的UNION节点，则创建UnionScope置为当前Scope，并将每个UNION两侧UnionItem逐个跳至S33)进行递归处理；最终UnionScope的持有字段以childScopeList中每个子Scope的持有字段叠加的方式构建；If the child node after FROM contains more than one UNION node, create a UnionScope and set it as the current Scope, and jump the UnionItems on both sides of each UNION to S33) for recursive processing; finally, the holding field of the UnionScope is based on each child Scope in the childScopeList Constructed in the way of superposition of holding fields;

S35)通过S32)-S34)遍历解析树ParseTree，构建S2)设计的抽象语法树AST。S35) Traverse the parse tree ParseTree through S32)-S34), and construct the abstract syntax tree AST designed by S2).

优选的，S4)包括如下步骤：Preferably, S4) comprises the steps of:

S41)从Scope中提取信息构建列算子血缘模型，将列算子血缘模型划分为算子Operator、物理列Column和虚拟列VirtualColumn；S41) Extracting information from Scope to build a column operator lineage model, and dividing the column operator lineage model into operator Operator, physical column Column, and virtual column VirtualColumn;

将算子Operator分为6种类型，包括：Divide operators into six types, including:

SELECT算子：每个投影项的内容，即SELECT后AS前的内容，从ProjectScope中提取；SELECT operator: the content of each projection item, that is, the content before AS after SELECT, is extracted from ProjectScope;

WHERE算子：WHERE后的内容，从ProjectScope中提取；WHERE operator: the content after WHERE is extracted from ProjectScope;

GROUP算子：GROUP后的内容，从ProjectScope中提取；GROUP operator: the content after GROUP is extracted from ProjectScope;

HAVING算子：HAVING后的内容，从ProjectScope中提取；HAVING operator: the content after HAVING is extracted from ProjectScope;

JOIN算子：ON条件后的内容，从JoinScope中提取；JOIN operator: the content after the ON condition is extracted from JoinScope;

UNION算子：用于聚合多个SELECT语句，从UnionScope中提取；UNION operator: used to aggregate multiple SELECT statements, extracted from UnionScope;

其中，SELECT、UNION算子属于直接血缘，用于追溯列的加工逻辑和加工链路，WHERE、GROUP、HAVING、JOIN算子属于间接血缘，用于列的影响面分析；Among them, the SELECT and UNION operators belong to the direct lineage, and are used to trace the processing logic and processing link of the column, and the WHERE, GROUP, HAVING, and JOIN operators belong to the indirect lineage, and are used to analyze the influence surface of the column;

物理列Column包括真实的输入物理列和输出物理列，属于元数据信息，输入物理列从ScanScope中提取，输出物理列从CreateAsScope或InsertScope中提取；The physical column Column includes the real input physical column and output physical column, which belongs to metadata information. The input physical column is extracted from ScanScope, and the output physical column is extracted from CreateAsScope or InsertScope;

虚拟列VirtualColumn：子查询中SELECT item AS alias中的alias部分，从ProjectScope中提取。Virtual column VirtualColumn: the alias part in SELECT item AS alias in the subquery, extracted from ProjectScope.

本发明的第二方面，还提出一种服务器，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现上述方法的步骤。The second aspect of the present invention further proposes a server, including a memory, a processor, and a computer program stored in the memory and operable on the processor, and the processor implements the steps of the above method when executing the program.

本发明的第三方面，还提出一种计算机可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现上述方法的步骤。In a third aspect of the present invention, a computer-readable storage medium is also provided, on which a computer program is stored, and when the program is executed by a processor, the steps of the above method are implemented.

本发明的有益效果是：与现有技术相比，该列算子血缘构建方法1)通过深度解析数据资产的加工逻辑脚本，来判断在企业复杂数据环境下，数据资产之间所有直接和间接依赖关系的详细逻辑，将真实的数据处理代码翻译成用户友好的表达能力，避免数据分析过程中由于数据不理解或者数据来源应用膨胀而导致无法自助数据分析，影响敏捷业务需求响应；The beneficial effects of the present invention are: compared with the prior art, the method for constructing lineages of operators 1) judges all direct and indirect relationships between data assets in the complex data environment of an enterprise by deeply analyzing the processing logic scripts of data assets. The detailed logic of dependencies translates real data processing codes into user-friendly expression capabilities, avoiding inability to self-service data analysis due to incomprehension of data or data source application expansion during the data analysis process, affecting agile business demand response;

2)列算子血缘是更细粒度的一种血缘，可以通过列算子血缘分析出列直接来源、使用的函数、加工口径以及受到的间接影响。2) Column operator blood relationship is a more fine-grained blood relationship, which can be used to analyze the direct source, function used, processing caliber, and indirect impact of the column.

附图说明Description of drawings

为了更清楚地说明本发明实施方式的技术方案，下面将对实施方式中所需要使用的附图做简要介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following will briefly introduce the accompanying drawings that need to be used in the embodiments. Obviously, the accompanying drawings in the following description are only some embodiments of the present invention. For those skilled in the art Ordinary technicians can also obtain other drawings based on these drawings on the premise of not paying creative work.

图1为本发明系统流程图；Fig. 1 is the system flowchart of the present invention;

图2为本发明6种Scope构架系统图；Fig. 2 is 6 kinds of Scope framework system diagrams of the present invention;

图3为本发明6种Scope的关系图；Fig. 3 is the relation diagram of 6 kinds of Scopes of the present invention;

图4为本发明步骤3)递归遍历流程图；Fig. 4 is step 3) recursive traversal flowchart of the present invention;

图5为本发明Scope构建和血缘填充关系图；Fig. 5 is a relationship diagram between Scope construction and blood relationship filling in the present invention;

图6为本发明在Scope中提取算子的流程图；Fig. 6 is the flow chart of extracting operator in Scope in the present invention;

图7为本发明在Scope中提取的算子与Scope结构对照示意图；Fig. 7 is a schematic diagram of comparison between operators extracted in Scope and Scope structure in the present invention;

图8为图数据库中点边存储示意图。Fig. 8 is a schematic diagram of storing points and edges in a graph database.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

实施例1Example 1

请重点参阅图1，列算子血缘构建方法，包括如下步骤：Please refer to Figure 1, the column operator consanguinity construction method, including the following steps:

S1)通过Antlr解析SQL生成解析树ParseTree；通过Antlr解析SQL后生成的解析树ParseTree只是对原SQL各个节点信息的抽象展示，不能直接从其中提取列算子血缘信息。因此需要根据关系代数和SQL的基本语法重新设计一种抽象语法树AST，该抽象语法树AST将SQL分解为若干个主干和表达式，同时记录主干与主干之间、表达式与表达式之间的血缘关系，来实现提取列算子血缘的目的。S1) Analyzing SQL through Antlr to generate a parse tree ParseTree; ParseTree generated after parsing SQL through Antlr is only an abstract display of the information of each node in the original SQL, and cannot directly extract column operator blood relationship information from it. Therefore, it is necessary to redesign an abstract syntax tree AST according to the basic syntax of relational algebra and SQL. This abstract syntax tree AST decomposes SQL into several trunks and expressions, and records the relationship between the trunk and the trunk, and between expressions. blood relationship to achieve the purpose of extracting the blood relationship of column operators.

提取列算子血缘要求SQL有输入部分和输出部分，即查询部分和输出部分，所以本专利所能提取列算子血缘的SQL为两类：INSERT INTO TABLE SELECT…和CREATE TABLE ASSELECT…。输入部分为DQL类语句，输出部分为INSERT或CREATE。Extracting the lineage of column operators requires SQL to have an input part and an output part, that is, a query part and an output part, so there are two types of SQL for extracting lineage of column operators in this patent: INSERT INTO TABLE SELECT... and CREATE TABLE ASSELECT.... The input part is a DQL statement, and the output part is INSERT or CREATE.

上述抽象语法树AST的结构如下：The structure of the above abstract syntax tree AST is as follows:

1)基于关系代数，将一完整SQL划分为若干个段落，每一段落为一主干，本专利定义一种新的树形结构Scope用于抽象出SQL相应的主干，并可以反应出主干与主干之间的层级关系。每个Scope包含该Scope的类型、该Scope的别名以及该Scope对外暴露的字段，通过索引记录不同Scope之间的父子关系，设置Scope的外层主干集合为parentScopeList，即父Scope集合；设置Scope的内层主干集合为childrenScopeList，即子Scope集合，从而构建不同Scope之间的父子关系；1) Based on relational algebra, a complete SQL is divided into several paragraphs, and each paragraph is a trunk. This patent defines a new tree structure Scope to abstract the corresponding trunk of SQL, and can reflect the relationship between the trunk and the trunk. hierarchical relationship between them. Each Scope contains the type of the Scope, the alias of the Scope, and the fields exposed to the outside of the Scope. The parent-child relationship between different Scopes is recorded through the index, and the outer backbone collection of the Scope is set to parentScopeList, which is the parent Scope collection; The inner backbone collection is childrenScopeList, which is the collection of child scopes, so as to build the parent-child relationship between different scopes;

2)上述所说支持提取列算子血缘的SQL包含了输入部分和输出部分。所以Scope也分为输入型Scope和输出型Scope两类；其中输入型Scope划分为ProjectScope、JoinScope、UnionScope和ScanScope，主要抽象出SQL语句中的输入部分；输出型Scope划分为CreateAsScope和InsertScope，主要抽象出SQL语句中两种主要的输出部分；2) The above-mentioned SQL supporting the extraction of column operator lineage includes an input part and an output part. Therefore, Scope is also divided into two types: input type Scope and output type Scope; among them, input type Scope is divided into ProjectScope, JoinScope, UnionScope and ScanScope, which mainly abstract the input part in the SQL statement; output type Scope is divided into CreateAsScope and InsertScope, which mainly abstract There are two main output parts in the SQL statement;

3)请重点参阅图2，对2)声明的6种Scope进行定义，如果SQL是CREATE TBALEtable_name AS，则将其解释为CreateAsScope，由于CREATE TBALE table_name AS在一段SQL中是在开始位置作为输出部分,所以对应在树形结构Scope中是根节点，AS后的部分解释为其他类型的Scope，同时用索引记录父子关系。如果SQL是INSERT INTO table_name(col1,col2…)SELECT，则将SELECT前的部分解释为InsertScope，同样的，由于INSERTINTO table_name(col1,col2…)在一段SQL中也是在开始位置作为输出部分，所以对应在树形结构Scope中是根节点，AS后的部分解释为其他类型的Scope，同时用索引记录父子关系。ProjectScope用于抽象SELECT语句的结构，包含了一段SELECT语句的投影部分、WHERE部分、GROUP BY部分、HAVING部分和ORDER BY部分。如果SELECT语句结束之后是UNIONSELECT…，则将第一个SELECT语句部分到最后一个SELECT语句部分一同抽象为UnionScope，每个SELECT语句部分都是一个单独的ProjectScope，通过索引记录到该UnionScope的childScopeList集合中。在ProjectScope中，FROM后和WHERE之前的内容解释为一个新的Scope，并通过索引记录到本层Scope的children Scope集合中，也就是内层主干。如果FROM后是一个物理表名table_name，那么将该物理表名table_name解释为ScanScope，再通过索引记录父子关系，在本发明设计的树形结构Scope中，叶子节点均为ScanScope；如果FROM后的物理表名后有JOIN部分，那么将FROM后和WHERE之前的部分解释为JoinScope，JoinScope的别名是空，每个JOIN两侧的内容继续解析为相应的Scope，同时记录每一次JOIN的ON条件内容。如果FROM后是一个子查询(subquery)AS alias，那么这个subQuery同样解释为ProjectScope，并且别名记录为alias，再通过索引记录父子关系，如果FROM后是多个子查询UNION在一起，即(subQuery UNION subQuery…)AS alias，则该部分解释为UnionScope，以此类推。3) Please refer to Figure 2 to define the 6 Scopes declared in 2). If the SQL is CREATE TBALEtable_name AS, it will be interpreted as CreateAsScope. Since CREATE TBALE table_name AS is at the beginning of a piece of SQL as the output part, Therefore, corresponding to the root node in the tree structure Scope, the part after AS is interpreted as other types of Scope, and the parent-child relationship is recorded with the index. If the SQL is INSERT INTO table_name (col1, col2…) SELECT, interpret the part before SELECT as InsertScope. Similarly, since INSERT INTO table_name (col1, col2…) is also at the beginning of a piece of SQL as the output part, so the corresponding In the tree structure Scope is the root node, the part after AS is interpreted as other types of Scope, and the parent-child relationship is recorded with the index. ProjectScope is used to abstract the structure of the SELECT statement, including the projection part, WHERE part, GROUP BY part, HAVING part and ORDER BY part of a SELECT statement. If the SELECT statement is followed by UNIONSELECT..., the first SELECT statement part to the last SELECT statement part are abstracted into a UnionScope, and each SELECT statement part is a separate ProjectScope, which is recorded into the childScopeList collection of the UnionScope through the index . In ProjectScope, the content after FROM and before WHERE is interpreted as a new Scope, and is recorded into the children Scope collection of this layer Scope through the index, that is, the inner backbone. If FROM is followed by a physical table name table_name, then the physical table name table_name is interpreted as ScanScope, and then the parent-child relationship is recorded through the index. In the tree structure Scope designed by the present invention, the leaf nodes are ScanScope; if the physical table name after FROM If there is a JOIN part after the table name, interpret the part after FROM and before WHERE as JoinScope, the alias of JoinScope is empty, and the content on both sides of each JOIN continues to be parsed into the corresponding Scope, and the ON condition content of each JOIN is recorded at the same time. If FROM is followed by a subquery (subquery) AS alias, then this subQuery is also interpreted as ProjectScope, and the alias is recorded as alias, and then the parent-child relationship is recorded through the index. If FROM is followed by multiple subqueries UNION together, that is (subQuery UNION subQuery ...) AS alias, then this part is interpreted as UnionScope, and so on.

请重点参阅图3，在设计的Scope抽象语法树AST中，不同Scope的父子关系如下：Please refer to Figure 3. In the designed Scope abstract syntax tree AST, the parent-child relationship of different scopes is as follows:

CreateAsScope或InsertScope，他们的parentScopeList一定是空，childrenScopeList一定不为空且是ProjectScope或UnionScope的其中一个。CreateAsScope or InsertScope, their parentScopeList must be empty, childrenScopeList must not be empty and is one of ProjectScope or UnionScope.

ProjectScope的parentScopeList不为空，且是CreateAsScope、InsertScope、JoinScope、ProjectScope、UnionScope的其中一个，childrenScopeList不为空且是JoinScope、ProjectScope、UnionScope、ScanScope的其中一个。The parentScopeList of ProjectScope is not empty and is one of CreateAsScope, InsertScope, JoinScope, ProjectScope, and UnionScope, and the childrenScopeList is not empty and is one of JoinScope, ProjectScope, UnionScope, and ScanScope.

JoinScope的parentScopeList不为空且一定是ProjectScope，childrenScopeList不为空且至少有两个以上，为ProjectScope、UnionScope、ScanScope的任意组合。The parentScopeList of JoinScope is not empty and must be ProjectScope, and the childrenScopeList is not empty and has at least two or more, which is any combination of ProjectScope, UnionScope, and ScanScope.

UnionScope的parentScopeList不为空，且是CreateAsScope、InsertScope、JoinScope、ProjectScope、UnionScope的其中一个，childrenScopeList不为空且至少有两个以上，为ProjectScope、JionScope的任意组合。The parentScopeList of UnionScope is not empty and is one of CreateAsScope, InsertScope, JoinScope, ProjectScope, and UnionScope. The childrenScopeList is not empty and has at least two, which is any combination of ProjectScope and JionScope.

ScanScope的parentScopeList不为空，且是ProjectScope、JoinScope的其中一个，childrenScopeList一定为空。The parentScopeList of ScanScope is not empty, and it is one of ProjectScope and JoinScope, and the childrenScopeList must be empty.

4)定义每个Scope对外暴露字段为该Scope的持有字段集合holdFields，每个Scope的持有字段集合可以被其parentScopeList所引用；其中ScanScope的持有字段集合为元数据信息，即该物理表下的列名集合为该ScanScope的持有字段集合。例如t1表有a,b,c三个列，则{“t1”:[“a”,“b”,“c”]}为该ScanScope的持有字段集合。ProjectScope的持有字段集合是该SELECT语句中的投影部分，例如SELECT a,b,c as c1 FROM，别名为alias，如果没有则为空字符串，则该ProjectScope的持有字段集合为{“alias”:[“a”,“b”,“c1”]}。UnionScope的持有字段集合为每个子Scope的持有字段进行叠加，例如(SELECT d,e,f)UNION(SELECT g,h,i)，则该UnionScope的持有字段集合为{“alias”:[“d”,“e”,“f”]}。JoinScope的持有字段为每个子Scope的持有字段集合汇总后打宽存储，例如t1 JOIN(SELECT d,e,f)，其中t1有a,b,c三个列，则该JoinScope的持有字段集合为{“”:[a,b,c,d,e,f]}，JoinScope别名一定为空，所以持有字段集合的key为空。4) Define each Scope’s exposed fields as the Scope’s held field set holdFields, and each Scope’s held field set can be referenced by its parentScopeList; the ScanScope’s held field set is metadata information, that is, the physical table The set of column names below is the set of held fields of the ScanScope. For example, the t1 table has three columns a, b, and c, then {"t1":["a", "b", "c"]} is the set of held fields of the ScanScope. The set of fields held by ProjectScope is the projected part in the SELECT statement, for example, SELECT a,b,c as c1 FROM, the alias is alias, if not, it is an empty string, then the set of fields held by the ProjectScope is {"alias ":["a", "b", "c1"]}. The UnionScope's holding field set is superimposed for each sub-Scope's holding field, such as (SELECT d,e,f)UNION(SELECT g,h,i), then the UnionScope's holding field set is {"alias": ["d", "e", "f"]}. The holding fields of JoinScope are stored after summarizing the collection of holding fields of each sub-scope. For example, t1 JOIN (SELECT d, e, f), where t1 has three columns a, b, and c, then the holding fields of this JoinScope The field collection is {"":[a,b,c,d,e,f]}, and the JoinScope alias must be empty, so the key holding the field collection is empty.

CreateAsScope的持有字段集合为childrenScope的持有字段集合，列如CRAETETABLE table_name AS SELECT a,b,c，则该Scope的持有字段集合为{“table_name”:[“a”,“b”,“c”]}。InsertScope的持有字段集合是目标表元数据列信息集合，例如INSERT INTOtable_name(a,b,c)SELECT d,e,f，则该Scope的持有字段集合为{“table_name”:[“a”,“b”,“c”]}。The set of fields held by CreateAsScope is the set of fields held by childrenScope, such as CRAETETABLE table_name AS SELECT a, b, c, then the set of fields held by this Scope is {"table_name":["a", "b", " c"]}. The set of fields held by InsertScope is the set of metadata column information of the target table. For example, INSERT INTOtable_name(a,b,c) SELECT d,e,f, then the set of fields held by the Scope is {“table_name”:[“a” , "b", "c"]}.

5)上述设计的Scope具有父子关系，也就是来源关系，那么每个Scope的持有字段也具有来源关系，这里将持有字段集合中的持有字段附加一个来源属性ExpressionOrigin，即该持有字段是来源于childrenScopeList中哪一个Scope的哪一个持有字段，就可以从根Scope，即CreateAsScope或InsertScope的持有字段溯源到物理层ScanScope的持有字段，从而构建输入列与输出列之间的血缘链路。5) The Scope designed above has a parent-child relationship, that is, an origin relationship, so the holding fields of each Scope also have an origin relationship. Here, an origin attribute ExpressionOrigin is attached to the holding field in the holding field collection, that is, the holding field Which field is derived from which Scope in the childrenScopeList, it can be traced from the root Scope, that is, the field of CreateAsScope or InsertScope to the field of the physical layer ScanScope, so as to construct the blood relationship between the input column and the output column link.

6)仅通过持有字段还无法直观看出列的加工逻辑，本专利定义Expression表达式来记录列的表达形式。表达式可以是一个投影项，例如SELECT MAX(a+b)AS col1,c AScol2,1AS col3，其中MAX(a+b)就是一个表达式，c和1也是一个表达式。同样WHERE后的内容也是一个表达式，例如WHERE c1>c2 AND c3＝c4 OR c5+c6–MAX(c7)，其中c1>c2 AND c3＝c4 OR c5+c6–MAX(c7)整体就是一个表达式。6) The processing logic of the column cannot be seen intuitively only by holding the field. This patent defines Expression expression to record the expression form of the column. The expression can be a projection item, such as SELECT MAX(a+b) AS col1,c AScol2,1AS col3, where MAX(a+b) is an expression, and c and 1 are also an expression. Similarly, the content after WHERE is also an expression, such as WHERE c1>c2 AND c3=c4 OR c5+c6–MAX(c7), where c1>c2 AND c3=c4 OR c5+c6–MAX(c7) is an expression as a whole Mode.

对于GROUP BY和ORDER BY后的每一个子部分也都是一个表达式。表达式在存储结构上设计为树结构，每一个节点有当前表达式的类型，以及每一个子表达式的引用，以及这个表达式的来源属性ExpressionOrigin，用于存储该表达式的来源信息，与上述持有字段的来源属性ExpressionOrigin是同一个属性。通过表达式可以清楚知道列的加工逻辑是什么，同时通过表达式的来源信息知道列的来源是什么，最后再通过持有字段一层层向上传递，就可以构建输出列到来源列之间的加工链路，进而提取列算子血缘。Each subpart after GROUP BY and ORDER BY is also an expression. The storage structure of the expression is designed as a tree structure, each node has the type of the current expression, and the reference of each sub-expression, and the expression origin attribute ExpressionOrigin, which is used to store the origin information of the expression, and The source property ExpressionOrigin of the above holding field is the same property. Through the expression, you can clearly know what the processing logic of the column is, and at the same time know what the source of the column is through the source information of the expression, and finally pass it up layer by layer by holding the field, you can build the link between the output column and the source column Process the link, and then extract the lineage of column operators.

S3)递归遍历S1)得到的解析树ParseTree构建S2)设计的抽象语法树AST；构建Scope和Expression的过程如下：构建输出列与来源列之间的列算子血缘可以转化为构建本层Scope和childrenScopeList中内层Scope之间列的依赖关系，进一步的，每一个持有字段只可能来源于他的“子查询”，即来源于childrenScopeList，因此“列算子血缘溯源构建”可以转化为“子查询持有原字段的填充问题”。具体过程如下：S3) Recursively traverse the parsing tree ParseTree obtained in S1) to construct the abstract syntax tree AST designed by S2); the process of constructing Scope and Expression is as follows: constructing the column operator lineage between the output column and the source column can be transformed into the construction of the Scope and Expression of this layer The dependency relationship between columns between the inner Scopes in childrenScopeList, further, each held field can only come from its "subquery", that is, from childrenScopeList, so "column operator lineage traceability construction" can be converted into "subquery Query holds padding issue for original field". The specific process is as follows:

请重点参阅图4和5，Please focus on Figures 4 and 5,

S33)若节点类型为查询节点select xx from，则将其解释为ProjectScope,并构建SELECT、WHERE、GROUP BY、HAVING、ORDER BY后的表达式；继续解释其FROM后的子节点为对应的Scope，并加入当前ProjectScope的childScopeList中，等待childScopeList中Scope的持有字段信息填充当前ProjectScope的表达式的来源信息，再使用表达式信息填充当前ProjectScope的持有字段；S33) If the node type is the query node select xx from, it is interpreted as ProjectScope, and the expression after SELECT, WHERE, GROUP BY, HAVING, ORDER BY is constructed; continue to explain the child node after its FROM as the corresponding Scope, And add it to the childScopeList of the current ProjectScope, wait for the field information of the Scope in the childScopeList to fill in the source information of the expression of the current ProjectScope, and then use the expression information to fill the field of the current ProjectScope;

请重点参阅图6、图7，S4)遍历S3)列血缘链路模型提取列算子血缘；得到上述抽象语法树AST之后，需要从根Scope进行遍历提取列算子血缘模型，列算子血缘模型中包含三大内容：算子Operator、物理列Column和虚拟列VirtualColumn。Please refer to Figure 6 and Figure 7, S4) Traversing S3) to extract column operator blood relationship from the column blood relationship link model; after obtaining the above abstract syntax tree AST, it is necessary to traverse from the root Scope to extract the column operator blood relationship model The model contains three major contents: operator operator, physical column Column and virtual column VirtualColumn.

本专利中将列算子血缘中的算子命名为Operator，共6种类型：In this patent, the operators in the lineage of operators are named Operators, and there are 6 types in total:

定义SELECT算子：每个投影项的内容，即SELECT后AS前的内容，从ProjectScope中提取；WHERE算子：WHERE后的内容，从ProjectScope中提取；GROUP算子：GROUP后的内容，从ProjectScope中提取；HAVING算子：HAVING后的内容，从ProjectScope中提取；JOIN算子：ON条件后的内容，从JoinScope中提取；UNION算子：用来聚合多个SELECT语句，从UnionScope中提取。Define SELECT operator: the content of each projection item, that is, the content before AS after SELECT, is extracted from ProjectScope; WHERE operator: the content after WHERE is extracted from ProjectScope; GROUP operator: the content after GROUP is extracted from ProjectScope HAVING operator: content after HAVING, extracted from ProjectScope; JOIN operator: content after ON condition, extracted from JoinScope; UNION operator: used to aggregate multiple SELECT statements, extracted from UnionScope.

其中SELECT、UNION算子属于直接血缘，用于追溯列的加工逻辑和加工链路，WHERE、GROUP、HAVING、JOIN算子属于间接血缘，用于列的影响分析。Among them, SELECT and UNION operators belong to direct blood relationship and are used to trace the processing logic and processing links of columns. WHERE, GROUP, HAVING, and JOIN operators belong to indirect blood relationship and are used to analyze the impact of columns.

物理列Column即为真实的输入物理列和输出物理列，属于元数据信息。定义子查询中SELECT item AS alias中的alias部分为虚拟列VirtualColumn。The physical column Column is the real input physical column and output physical column, which belongs to metadata information. Define the alias part in the SELECT item AS alias in the subquery as the virtual column VirtualColumn.

请重点参阅图7，例如：INSERT INTO t2(a,b)SELECT tt1.a AS a,tt1.b AS bFROM(SELECT a AS a,MAX(b+c)AS b FROM t1 WHERE d>1)AS tt1。Please focus on Figure 7, for example: INSERT INTO t2(a,b)SELECT tt1.a AS a,tt1.b AS bFROM(SELECT a AS a,MAX(b+c)AS b FROM t1 WHERE d>1)AS tt1.

列算子血缘模型即为从输入物理列，每经过一层子查询，就会输出一组算子和虚拟列，其中算子输出到虚拟列上，最外层查询的算子直接输出到输出物理列本身，不再输出到虚拟列，列算子血缘模型的链路如下：The column operator blood relationship model is to output a set of operators and virtual columns from the input physical columns every time a layer of subquery is passed, in which the operators are output to the virtual columns, and the operators of the outermost query are directly output to the output The physical column itself is no longer output to the virtual column. The link of the column operator lineage model is as follows:

输入物理列-->(算子-->虚拟列)-->(…)-->算子-->输出物理列。Input physical column-->(operator-->virtual column)-->(...)-->operator-->output physical column.

即，Column-->(Operator-->VirtualColumn)-->(…)-->Operator-->Column。That is, Column-->(Operator-->VirtualColumn)-->(...)-->Operator-->Column.

请重点参阅图8，S5)遍历算子血缘构建点边关系，将列算子血缘存入图数据库。Please refer to Figure 8, S5) Traversing the lineage of operators to construct a point-edge relationship, and storing the lineage of column operators into the graph database.

在图数据库中，创建相应的物理列、算子、虚拟列三种点类型，遍历S4)得到的列算子血缘模型，构建相应的点边关系，存入图数据库中：In the graph database, create corresponding physical columns, operators, and virtual columns of three types of points, traverse the column-operator lineage model obtained in S4), construct corresponding point-edge relationships, and store them in the graph database:

INSERT INTO t2(a,b)SELECT tt1.a AS a,tt1.b AS b FROM(SELECT a AS a,MAX(b+c)AS b FROM t1 WHERE d>1)AS tt1。INSERT INTO t2(a,b)SELECT tt1.a AS a,tt1.b AS b FROM(SELECT a AS a,MAX(b+c)AS b FROM t1 WHERE d>1)AS tt1.

图8为该SQL的列算子血缘在图数据库中存储的示意图。Fig. 8 is a schematic diagram of the column operator lineage of the SQL stored in the graph database.

实施例2Example 2

本发明提供一种服务器，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现实施例1中所述方法的步骤。The present invention provides a server, including a memory, a processor, and a computer program stored on the memory and operable on the processor. The processor implements the steps of the method in Embodiment 1 when executing the program.

实施例3Example 3

本发明提供一种计算机可读存储介质，其上存储有计算机程序，该程序被处理器执行时实施例1中所述方法的步骤。The present invention provides a computer-readable storage medium on which a computer program is stored, and the program is executed by a processor as the steps of the method described in Embodiment 1.

本领域内的技术人员应明白，本发明实施例的实施例可提供为方法、装置、或计算机程序产品。因此，本发明实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present invention may be provided as methods, devices, or computer program products. Accordingly, embodiments of the invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

以上所述的实施方式为优选实施方式而已，并不用于限制本发明，尽管参照前述实施例对本发明进行了详细的说明，对于本领域的技术人员来说，依然可以对前述实施所记载的技术方案进行修改，或者对其中部分技术特性进行等同替换，凡在本发明精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围内。The embodiments described above are only preferred embodiments, and are not intended to limit the present invention. Although the present invention has been described in detail with reference to the foregoing examples, those skilled in the art can still use the technology described in the foregoing embodiments Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present invention shall be included in the protection scope of the present invention.

Claims

1. The method for constituting the column operator consanguinity, comprising the following steps,

S1) parse SQL through Antlr to generate a parsing tree ParseTree;

S2) Design the abstract syntax tree AST, which is used to construct the lineage link from the input column to the output column and the processing logic in the middle;

S3) recursively traverse the parse tree ParseTree that S1) obtains and build the abstract syntax tree AST of S2) design;

S4) Traversing the abstract syntax tree AST constructed in S3) and extracting the column operator consanguinity model;

S5) Traversing the lineage model of the column operator and constructing a point-edge relationship, and storing the lineage of the column operator into the graph database.

2. The column operator consanguinity construction method as claimed in claim 1, characterized in that, S2) comprises the steps of:

S21) Based on relational algebra, a complete SQL is divided into at least one paragraph, each paragraph is a trunk, and a tree structure Scope is designed to abstract the corresponding trunk of SQL, and establish a hierarchical relationship between the trunk and the trunk;

S22) Scope is divided into two types: input type Scope and output type Scope;

Among them: the input type Scope is divided into ProjectScope, JoinScope, UnionScope and ScanScope, which are used to abstract the input part of the SQL statement;

The output type Scope is divided into CreateAsScope and InsertScope, which are used to abstract the output part of the SQL statement;

Among them: each Scope includes the type of the Scope, the alias of the Scope, and the fields exposed to the outside of the Scope. The parent-child relationship between different Scopes is recorded through the index, and the outer backbone set of the Scope is set to parentScopeList, which is the parent Scope set; set The inner backbone collection of Scope is childrenScopeList, which is the collection of child Scopes, so as to build the parent-child relationship between different Scopes;

S23) define the 6 kinds of Scopes divided by S22), which are used to correspond to the abstract syntax tree AST;

S24) Define the externally exposed fields of each Scope as the holding field set of the Scope, and the holding field set of each Scope can be referenced by the parent Scope in its parentScopeList;

S25) Adding an origin attribute ExpressionOrigin to each holding field in the holding field set to record the source information of the holding field, which is used to build a lineage link between the input column and the output column;

S26) Establish an expression for recording the processing logic of the column and the source of the column;

S27) Each Scope uses the field information of the child Scope in its childrenScopeList to fill the field information of the current Scope, and then transfers the field information of the current Scope to the parent Scope, thereby constructing the processing chain between the output column and the source column road, and then extract the lineage information of the column operator.

3. The column operator consanguinity construction method as claimed in claim 1, characterized in that, S3) comprises the steps of:

S31) recursively traverse the parse tree ParseTree generated by S1) from the root node;

S32) interpret the root node as CreateAsScope or InsertScope according to the type of SQL, and then traverse the child nodes;

S33) If the node type is the query node select xx from, it is interpreted as ProjectScope, and the expression after SELECT, WHERE, GROUP BY, HAVING, ORDER BY is constructed; the child node after its FROM is parsed as the corresponding Scope, and Add it to the childScopeList of the current ProjectScope, wait for the field information of the Scope in the childScopeList to fill in the source information of the expression of the current ProjectScope, and then use the expression information to fill the field of the current ProjectScope;

S34) If the child node after FROM is a physical table node, then create a ScanScope and set it as the current Scope, and fill in the holding field of the current Scope according to the metadata information;

If the sub-node after FROM is a sub-query node, then skip to S33) for recursive processing;

If the child node after FROM contains more than one JOIN node, then create a JoinScope and set it as the current Scope, construct all the expressions after the ON condition, and skip the JoinItem on both sides of each JOIN one by one to S33) for recursive processing; finally The holding field of JoinScope is constructed in the way of widening the holding field of each child Scope in childScopeList, and fills in the source information of all ON expressions;

If the child node after FROM contains more than one UNION node, create a UnionScope and set it as the current Scope, and jump the UnionItems on both sides of each UNION to S33) for recursive processing; finally, the holding field of the UnionScope is based on each child Scope in the childScopeList Constructed in the way of superposition of holding fields;

S35) Traverse the parse tree ParseTree through S32)-S34), and construct the abstract syntax tree AST designed by S2).

4. The column operator consanguinity construction method as claimed in claim 1, characterized in that, S4) comprises the steps of:

S41) Extracting information from Scope to build a column operator lineage model, and dividing the column operator lineage model into operator Operator, physical column Column, and virtual column VirtualColumn;

Divide operators into six types, including:

SELECT operator: the content of each projection item, that is, the content before AS after SELECT, is extracted from ProjectScope;

WHERE operator: the content after WHERE is extracted from ProjectScope;

GROUP operator: the content after GROUP BY is extracted from ProjectScope;

HAVING operator: the content after HAVING is extracted from ProjectScope;

JOIN operator: the content after the ON condition is extracted from JoinScope;

UNION operator: used to aggregate multiple SELECT statements, extracted from UnionScope;

Among them, the SELECT and UNION operators belong to the direct lineage, and are used to trace the processing logic and processing link of the column, and the WHERE, GROUP, HAVING, and JOIN operators belong to the indirect lineage, and are used to analyze the influence surface of the column;

The physical column Column includes the real input physical column and output physical column, which belongs to metadata information. The input physical column is extracted from ScanScope, and the output physical column is extracted from CreateAsScope or InsertScope;

Virtual column VirtualColumn: the alias part in SELECTitem AS alias in the subquery, extracted from ProjectScope.

5. A server, characterized in that it comprises a memory, a processor, and a computer program stored on the memory and operable on the processor, when the processor executes the program, it realizes any one of claims 1-4. steps of the method described above.

6. A computer-readable storage medium, wherein a computer program is stored thereon, and when the program is executed by a processor, the steps of the method according to any one of claims 1-4 are realized.