WO2025137522A1 - Environnement de développement pour la génération automatique de code à l'aide d'un modèle de métadonnées multiniveau - Google Patents
Environnement de développement pour la génération automatique de code à l'aide d'un modèle de métadonnées multiniveau Download PDFInfo
- Publication number
- WO2025137522A1 WO2025137522A1 PCT/US2024/061392 US2024061392W WO2025137522A1 WO 2025137522 A1 WO2025137522 A1 WO 2025137522A1 US 2024061392 W US2024061392 W US 2024061392W WO 2025137522 A1 WO2025137522 A1 WO 2025137522A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- dataset
- metadata
- data
- control
- values
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2365—Ensuring data consistency and integrity
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2423—Interactive query statement specification based on a database schema
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2428—Query predicate definition using graphical user interfaces, including menus and forms
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/26—Visual data mining; Browsing structured data
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Definitions
- This disclosure relates to development environments, systems, and methods for automatically generating code or other logic using a multi-tiered metadata model. Specifically, this disclosure provides a development environment that visualizes items of metadata and enables controls to be defined based on that metadata through graphical or visual programming approaches. These defined controls are then used to automatically generate code for processing data that is related to the metadata according to a metadata model.
- Data governance involves the establishment of policies and procedures to ensure high data quality, security, and regulatory compliance.
- data governance has relied on manual processes in which a data steward specifies the requirements for data, and developers implement these requirements using program code. While effective in smaller environments, these manual approaches often become inefficient as data volumes and complexities increase, leading to inconsistencies and errors that can compromise the integrity of governance efforts.
- a method implemented by a data processing system for improving data governance by defining a single control based on a semantic meaning of data and enabling the single control to be automatically applied to multiple, disparate data elements associated with the semantic meaning to govern the data elements including: storing, in a data store, a metadata model including one or more first items of metadata and one or more second items of metadata, with at least one of the one or more first items of metadata specifying a semantic meaning associated with at least one of the one or more second items of metadata, wherein the metadata model specifies a relationship between the at least one of the one or more first items of metadata and the at least one of the one or more second items of metadata; receiving, by a data processing system, a control defined based on the at least one of the one or more first items of metadata specifying the semantic meaning; updating, by a data processing system, the metadata model to include a third item of metadata representing the control; specifying, by a data processing system, a relationship between the third item of metadata representing the control
- operations of the method include rendering, by a data processing system, a user interface including one or more visualizations of the one or more first items of metadata; receiving, by a data processing system and from the user interface, selection data specifying selection of at least one of the one or more visualizations and one or more operations to be applied to data associated with the at least one of the one or more visualizations, the at least one of the one or more visualizations corresponding to the at least one of the one or more first items of metadata specifying the semantic meaning; and generating, by a data processing system and based on the selection data, the control defined based on the at least one of the one or more first items of metadata specifying the semantic meaning.
- operations of the method include receiving, by a data processing system, a specification to process the one or more data elements; responsive to the specification, identifying, based on the metadata model, the at least one of the one or more second items of metadata associated with the one of the one or more data elements; identifying, based on the metadata model, the at least one of the one or more first items of metadata related to the at least one of the one or more second items of metadata; identifying, based on the metadata model, the third item of metadata representing the control defined based on the at least one of the one or more first items of metadata specifying the semantic meaning; and generating instructions for applying the control to the one or more data elements; and executing the instructions to apply the control to the one or more data elements.
- operations of the method include applying, by the data processing system, the control to the one or more data elements by accessing data specifying one or more characteristics of the one or more data elements or one or more datasets including the one or more data elements; based on the data specifying the one or more characteristics, generating instructions for applying the control to the one or more data elements, and executing the instructions to apply the control to the one or more data elements.
- generating the instructions for applying the control to the one or more data elements includes generating first instructions for accessing, from a data store, one or more values of the one or more data elements; generating second instructions for applying the control to the one or more values of the one or more data elements, the second instructions including at least one operation to be performed on the one or more values of the one or more data elements based on the data specifying the one or more characteristics; and generating third instructions for storing the one or more values of the first of the dataset to which the control is applied.
- the control is defined based on at least two of the one or more first items of metadata specifying the semantic meaning
- the method including: applying, by a data processing system, the control, by: identifying, based on the metadata model, one of the one or more second items of metadata related to each of the at least two of the one or more first items of metadata; accessing data specifying a correlation between the one of the one or more second items of metadata related to each of the at least two of the one or more first items of metadata; based on the data specifying the correlation, generating instructions for applying the control to a data element associated with the one of the one or more second items of metadata related to each of the at least two of the one or more first items of metadata, and executing the instructions to apply the control to the data elements.
- generating the instructions based on the data specifying the correlation includes: based on the data specifying the correlation, generating instructions for joining a data element associated with the one of the one or more second items of metadata related to each of the at least two of the one or more first items of metadata; and generating instructions for applying the control to the joined data elements.
- updating the metadata model to include the third item of metadata representing the control includes generating an instance of a data structure that includes the metadata representing the control and wherein the relationship between the third item of metadata representing the control and the at least one of the one or more first items of metadata is specified by the instance of the data structure including a reference to another instance of a data structure associated with the at least one of the one or more first items of metadata, or the other instance of the data structure associated with the at least one of the one or more first items of metadata including a reference to the instance of the data structure.
- the instructions for applying the control to the one or more data elements are code.
- control is applied to the one or more data elements in accordance with the characteristics of: the one or more data elements, or one or more datasets including the one or more data elements, to which the control is applied.
- the metadata model incorporates the data specifying the one or more characteristics, such as into links between items of the second items of metadata.
- the data specifying the one or more characteristics includes data specifying the data types of the one or more data elements or the one or more datasets including the one or more data elements.
- the data specifying one or more characteristics of the one or more data elements, or one or more datasets including the one or more data elements include information about intra- and/or inter-dataset relationships
- the applying of the control to one or more data elements includes applying the control across multiple data elements within the same or different one or more datasets.
- the one or more characteristics include information identifying at least one of a primary key, a record format, a data type of a field, or a primary-foreign key relationship with another dataset.
- executing of the instructions includes: compiling the instructions to produce executable code; and executing the executable code to apply the control to the one or more data elements.
- a method implemented by a data processing system for using a development environment to automatically generate code from a multi-tiered metadata model including: receiving, by a data processing system, a specification to process at least a portion of a dataset; responsive to the specification, accessing, by a data processing system, one or more characteristics of the dataset; and identifying, by a data processing system, one or more controls received from a development environment to be applied to one or more values of a field of the dataset in accordance with a metadata model, by: accessing a first instance of a data structure storing an identifier that corresponds to the dataset; based on a reference to a second instance of a data structure stored in the first instance of the data structure, accessing the second instance of the data structure associated with the field of the dataset; based on a reference to a third instance of a data structure stored in the second instance of the data structure, accessing the third instance of the data structure associated with metadata that describes one or more values of the field of the
- the one or more characteristics of the dataset include a primary-foreign key relationship with another dataset.
- the reference stored in each of the first instance of the dataset, the second instance of the dataset, and the third instance of the dataset is a respective pointer to a memory location at which the instance of the data structure referred to by the reference is stored.
- generating the code for applying the control to the one or more values of the field of the dataset includes generating first code for accessing the one or more values of the field of the dataset from a data store; generating second code for applying the control to the one or more values of the field of the dataset, the second code including at least one operation to be performed on the one or more values of the field of the dataset based on the determined one or more characteristics of the dataset; and generating third code for storing the one or more values of the first of the dataset to which the control is applied.
- the at least one operation comprises an operation to transform a data type of the one or more values of the field of the dataset.
- the at least one operation comprises an operation to join the one or more values of the field of the dataset with one or more values of a field of another dataset.
- control is defined based on the metadata that describes the one or more values of the field of the dataset and second metadata that describes one or more values of a field of another dataset.
- generating the code for applying the control to one or more values of the field of the dataset includes generating code for joining the one or more values of the field of the dataset with the one or more values of the field of the other dataset; and generating code for applying the control to the joined one or more values of the field of the dataset and the one or more values of the field of the other dataset.
- operations of the method include: segmenting the metadata model; and identifying, based on the segmented metadata model, the one or more controls to be applied to the one or more values of the field of the dataset.
- executing the code includes: compiling the code to produce executable code; and executing the executable code to apply the control to the one or more values of the field of the dataset.
- applying the identified control to the one or more values of the field of the dataset includes executing code on the one or more values of the field of the dataset, to which the control is applied, in accordance with the characteristics of the datasets to which the control is applied.
- the one or more characteristics include information about intra- and/or inter-dataset relationships
- applying the identified control to the one or more values of the field of the dataset includes applying executable code across multiple fields within the same or different one or more datasets.
- a data processing system includes one or more processors and memory storing instructions executable by the one or more processors to perform the method of any of the first through nineteenth aspects.
- one or more non-transitory computer- readable storage media store instructions executable by one or more processors to perform the method of any of the first through nineteenth aspects.
- an apparatus in a thirtieth aspect, includes one or more processors and memory storing instructions executable by the one or more processors to perform the method of any of the first through nineteenth aspects.
- One or more of the above aspects may provide one or more of the following advantages.
- the techniques described here provide a development environment that enables a non-technical user to define metadata controls, rules, and other logic at a logical level. These controls are then automatically propagated down to data, including existing data and new data added into the system after the control has been defined. As such, new controls do not need to be defined for each new dataset that is added to a system.
- the system described here automatically applies those controls to new and existing datasets, thereby making data governance efficient. This is because controls tend to stabilize over time. Once the controls have been defined, the system can automatically apply these controls to new datasets without new controls having to be defined. In this way, the techniques described here perform data governance more efficiently and with less resource consumption relative to systems that perform governance by defining controls individually for each dataset.
- the techniques described here also improve the accuracy and robustness of metadata-based data governance and other data processing by using a metadata model that incorporates characteristics of datasets into the links between items of technical metadata, which represent data, and items of logical metadata, which give meaning to the technical metadata.
- the metadata model can include metadata specifying the data types, scope (e.g., system or application), and other attributes of datasets or their data elements, which allows top-level controls to be transformed into executable logic in a way that accounts for the physical level characteristics of the underlying data.
- the metadata model can include information about intra- and inter-dataset relationships, which enables top-level controls to be defined across multiple data elements within the same or different datasets. In this manner, arbitrarily complex controls, rules, and other logic can be defined at a logical level and then automatically and accurately applied to both new and existing data at the physical level.
- a templated control is described in which the control is defined without reference to any specific item of data. In this manner, the templated control only needs to be defined once before being applied across some or all of the data described in a metadata model, thereby increasing the efficiency with which the data is governed.
- anomaly detection controls are described that are configured to identify anomalies in defined segments of a metadata model, thereby facilitating the identification of a root cause of data quality issues.
- a data structure which has multiple connected instances.
- the inventors have recognized that using these multiple instances of the data structure, each preferably linked via a pointer in memory, the computer can be controlled along its path to the desired control code to be applied to a dataset in a way that is particularly computationally efficient.
- FIG. 1 is a diagram of an example metadata model.
- FIGS. 2A and 2B are diagrams of example systems for generating and applying metadata controls for data processing.
- FIGS. 3 A and 3B are diagrams of example systems for generating metadata controls.
- FIGS. 4 A and 4B are diagrams of example systems for applying metadata controls for data processing.
- FIG. 4C is a diagram of an example interface visualizing the results of applying metadata controls.
- FIG. 5 is a diagram of an example system for applying metadata controls to new data.
- FIGS. 6 A and 6B are diagrams of example systems for applying metadata controls across multiple datasets.
- FIGS. 7 A and 7B are diagrams of example systems for applying templated controls.
- FIG. 8 is a diagram of an example system for applying controls for anomaly detection.
- FIGS. 9 and 10 are flow diagrams of example processes for generating and applying metadata controls for data processing.
- FIG. 11 is a diagram of an example computing system.
- Modern data processing systems store overwhelming volumes of complex data that needs to be governed or otherwise managed.
- a data processing system of a large organization may store millions of datasets (e.g., tables or files), with each dataset containing multiple data elements (e.g., columns or fields) that need to be governed.
- This data is dynamic, and the amount of data continuously grows over time.
- controls e.g., rules or other logic
- new controls would need to be defined to govern that new dataset or data element, creating a perpetual cycle of constantly defining controls.
- a data processing system can store technical metadata that specifies the names of the data elements within its data store(s).
- the data processing system can also store logical metadata specifying the logical concepts that describe or give meaning to the data elements.
- Manual and/or automatic processes can then be performed to link each item of technical metadata to a corresponding item of logical metadata.
- data governance controls can be specified with respect to an item of logical metadata and automatically applied to linked items of technical metadata.
- directly linking technical metadata and logical metadata may not provide sufficient information to govern the underlying data effectively.
- the data stored by an organization often resides in several different systems or applications, each having different capabilities. This data can have a wide range of different data types, data formats, and other characteristics.
- the data may have specific relationships, both within a single dataset and across multiple datasets. Using metadata to link data elements to logical concepts, without more, may not provide this contextual information, leading to subpar data governance.
- a control defined with respect to that logical concept may operate as intended when executed against a data element containing data of one data type (e.g., integer), but may produce errors when executed against another data element containing data of a different data type (e.g., string).
- data element containing data of one data type e.g., integer
- a different data type e.g., string
- the present disclosure describes techniques for improved metadata-based data governance using a multi-tiered metadata model that incorporates characteristics of datasets into the flow among items of technical metadata, which represent data, and items of logical metadata, which give meaning to the technical metadata.
- the metadata model can include metadata specifying the data types, scope (e.g., system or application), and other attributes of datasets or their data elements, which allows top-level controls to be resolved in a way that takes the physical level characteristics of the underlying data into account.
- the metadata model can include information about intra- and inter-dataset relationships, which enables top-level controls to be defined across multiple data elements within the same or different datasets.
- a metadata model is a structured representation of metadata and relationships among the metadata.
- the metadata model can be an object model or a data structure (e.g., a schema) that includes nodes representing items of metadata (e.g., technical or logical metadata) and edges representing relationships among the items of metadata.
- technical metadata includes metadata that describes attributes of stored data, such as its technical name (e.g., dataset name, field name, etc.).
- Logical metadata includes metadata that gives meaning or context to data, such as its semantic or business name.
- a node can be a data object or other data structure that includes values for attributes of the item of metadata that it represents.
- the attributes included in a node can depend on the type or class of metadata that the node represents.
- a node representing a dataset can include a dataset name attribute that is populated with the name of the dataset that the node represents.
- An edge can be a reference, a pointer, a data object, or another data structure that specifies a relationship between nodes.
- an edge can represent a hierarchical relationship between nodes (e.g., a parent-child relationship), such as a relationship between a dataset node (the parent) and a node of a technical data element it contains (the child).
- an edge can represent an associative relationship between nodes, such as a relationship between a technical data element node and a business data element node that describes or gives meaning to the technical data element node.
- metadata model 100 is a multi-tiered metadata model that includes several tiers or layers corresponding to different types of metadata, with nodes in a given layer representing an item of metadata of that type.
- metadata model 100 has a dataset layer that includes nodes 102a, 102b specifying metadata for datasets that are stored in a data store or other storage device.
- Each of nodes 102a, 102b can specify values for one or more attributes of a dataset, such as the name of the dataset.
- nodes 102a, 102b are shown in this example, other examples can include many more nodes (e.g., thousands or millions) representing all of the datasets stored in a data processing system.
- TDE technical data element
- each of nodes 104a, . . ., 104f can specify a name of the corresponding TDE, among other attributes.
- nodes 104a, . . ., 104c represent TDEs of the dataset corresponding to node 102a
- nodes 104d, . . ., 104f represent TDEs of the dataset corresponding to node 102b.
- 104c can be connected to dataset node 102a by an edge (e.g., a reference, a pointer, a data object, etc.), and each of TDE nodes 104d, . . 104f can be connected to dataset node 102b by an edge.
- an edge e.g., a reference, a pointer, a data object, etc.
- Metadata model 100 also includes a business data element (BDE) layer.
- BDE business data element
- nodes 108a, 108b in the BDE layer specify logical names, terms, or other metadata that describes or gives meaning to TDEs and their underlying data.
- semantic discovery processes can be used in which a series of statistical checks on a TDE and its associated data are performed in order to discover, classify, and label the TDE (and its data) with a BDE representing their semantic meaning. Additional details regarding the semantic discovery processes are described in U.S. Patent No. 11,704,494, titled “Discovering a semantic meaning of data fields from profile data of the data fields,” the entire content of which is incorporated herein by reference.
- TDE nodes 104a, 104e each relate to the same logical concept represented by BDE node 108a
- TDE nodes 104b, 104f each relate to the same logical concept represented by BDE node 108b.
- BDE node 108a describes or gives meaning (e.g., a semantic meaning) to each of TDE nodes 104a, 104e
- BDE node 108b describes or gives meaning to each of TDE nodes 104b, 104f.
- BDE nodes 108a, 108b can also describe or give meaning to other TDE nodes, as depicted in FIG. 1.
- TDE nodes 104a, 104e are linked (e.g., by edges) to BDE node 108a
- TDE nodes 104b, 104f are linked to BDE node 108b.
- valuable context can be provided to the cryptic names of TDEs (e.g., XI 52) that convey little about the type of information they represent.
- controls defined with respect to a BDE can be automatically propagated to multiple linked TDEs, thereby reducing the number of controls that need to be defined.
- a controls layer that includes a node 110 specifying metadata that defines a control (e.g., a rule or other logic) for governing data.
- a control may refer to rules or other logic, which may be compiled into executable logic or executable code that is executable to apply the control to data.
- the node 110 specifies metadata that defines a control with respect to the BDEs represented by nodes 108a, 108b, and therefore is linked to each of BDE nodes 108a, 108b by edges.
- a control is defined with respect to one or more BDEs (or other logical elements in the metadata model) by referencing the BDE as a parameter in the control logic (e.g., Date of birth > 1/1/1900, where “Date of Birth” corresponds to a BDE).
- Controls can be manually or automatically created and can be of various types, such as monitoring controls that monitor data against criteria without altering the data, preventative controls that reject data that does not satisfy certain criteria, and corrective controls that correct data according to certain criteria, among others.
- Metadata model 100 also includes a dataset characteristics layer.
- the dataset characteristics layer specifies characteristics of datasets and their TDEs.
- the dataset characteristics layer can include metadata specifying the data types, relationships (e.g., primary-foreign key relationships), record format, scope (e.g., system or application), and/or other attributes of datasets and their TDEs.
- the dataset characteristics layer includes nodes 106a, 106b with metadata specifying respective characteristics for the datasets represented by nodes 102a, 102b and their TDEs represented by nodes 104a, . . ., 104f.
- the information provided by the characteristics layer enables controls defined with respect to one or more BDEs to be resolved (e.g., transformed into executable instructions) in a way that accounts for the characteristics of datasets and TDEs, thereby ensuring that the top-level controls accurately execute against the physical -level data.
- the characteristics layer can be or include a definition of an expanded view dataset that specifies a base dataset (e.g., a dataset corresponding to one of the nodes 102a, 102b) and datasets related to the base dataset, and also specifies logic for generating an expanded view dataset (e.g., a wide record) that includes the data from the base dataset and the related datasets.
- node 106a in the dataset characteristics layer can include a definition of an expanded view dataset that includes logic for joining (e.g., based on primary-foreign key relationships) a base dataset corresponding to node 102a with a related dataset corresponding to node 102b.
- node 106b in the dataset characteristics layer can include a definition of an expanded view dataset that includes logic for joining a base dataset corresponding to node 102b with a related dataset corresponding to node 102a. Additional details regarding the expanded view dataset definition are described below and in U.S. Patent Application No. 18/492,904, titled “Logical Access for Previewing Expanded View Datasets,” the entire content of which is incorporated herein by reference.
- the expanded view dataset definition (and/or other characteristics specified in the dataset characteristics layer) are used to improve the accuracy and efficiency of metadata-based data governance, as described herein.
- the example metadata model 100 illustrated in FIG. 1 depicts a particular set of layers
- additional or alternative layers can be used in some examples without departing from the scope of the present disclosure.
- one or more additional logical layers such as a business term layer and/or a business term group layer
- the dataset characteristics layer may be combined with the dataset layer (e.g., by including the characteristics specified in the dataset characteristics layer in the dataset nodes in the dataset layer).
- the metadata model 100 depicts a particular number of nodes in each layer, additional (or fewer) nodes can be included in some examples without departing from the scope of the present disclosure.
- system 200 for generating and applying metadata controls for data processing.
- system 200 includes a data processing system 202 having a metadata control engine 204, which in turn includes a guided expression editor 206, a control generator 208, and a control identifier 210.
- Data processing system 202 also includes an execution engine 212.
- system 200 also includes metadata repository 214 that stores, among other things, a metadata model 216 (which may be the same as or similar to metadata model 100 shown in FIG. 1).
- Guided expression editor 206 is configured to interact with a development environment 218 to provide a user interface that guides a user of the development environment 218 in generating, testing, and approving a control in an intuitive (e.g., no-code) manner.
- the guided expression editor 206 interacts with the metadata repository 214 (e.g., the metadata model 216) to identify valid parameters and operators that can be used in creating a control based on the current control state. Additional details regarding the guided expression editor 206 are described below with reference to FIG. 3B.
- the guided expression editor 206 transmits information about the control to control generator 208.
- Control generator 208 is configured to incorporate the control into metadata model 216 by, for example, adding a node to the metadata model 216 that specifies the control, and adding edges to link the control to other nodes (e.g., BDE nodes).
- a client device 220 (which may be the same as or different from the development environment 218) transmits data processing instructions to execution engine 212.
- the client device 220 can transmit a specification to the execution engine 212 that includes instructions for generating and/or executing a computer program (e.g., a dataflow graph) to perform operations on data.
- the execution engine 212 communicates with the control identifier 210, which is configured to traverse nodes and edges of the metadata model as described herein to identify controls and dataset characteristics that are applicable to the data to be processed. This information is passed to the execution engine 212, which generates an executable computer program that implements the controls based in part on the dataset characteristics.
- the execution engine 212 then executes the computer program on data retrieved from one or more storage systems 222a, . . . , 222n, and stores the governed output data in a storage system 224.
- system 200’ is shown, which is a version of system 200’.
- System 300 is shown for generating and applying metadata controls for data processing.
- System 300 is a version of system 200, and some of the reference numbers in FIG. 3 A are as described previously with reference to FIG. 2A.
- Metadata model 302 includes a node 304a specifying metadata for a dataset “Cust Contr,” and a node 304b specifying metadata for a dataset “Service Agrmt .”
- nodes 304a, 304b can specify a name of the respective dataset.
- Metadata model 302 also includes nodes 306a, . .
- nodes 306a, 306b, and 306c specify names of TDEs “st dt,” “en dt,” and “cid” that are part of the dataset “Cust Contr,” and nodes 306d, 306e, 306f specify names of TDEs “uid,” “fromdt,” and “todt” that are part of the dataset “Service Agrmt.”
- each of nodes 306a, 306b, 306c are connected to node 304a via an edge
- each of nodes 306d, 306e, 306f are connected to node 304b via an edge.
- each of TDE nodes 306a, 306b, and 306c may include a reference or a pointer to a memory location of dataset node 304a, and/or dataset node 304a may include a reference or a pointer to a memory location of each of TDE nodes 306a, 306b, 306c.
- each of TDE nodes 306a, 306b, 306c may include a reference to a unique identifier of dataset node 304a, and/or dataset node 304a may include a reference to a unique identifier of each of TDE nodes 306a, 306b, 306c. Similar techniques can be used to implement the edges between other nodes (e.g., TDE nodes 306d, 306e, 306f and dataset node 304b, among others).
- Metadata model 302 also includes a node 310a specifying metadata for a BDE “Contract Start Date,” and a node 310b specifying metadata for a BDE “Contract End Date.”
- nodes 310a, 310b can specify a name of the respective BDE, among other attributes (e.g., a description).
- BDE node 310a (“Contract Start Date”) describes or gives meaning to TDE node 306a (“st df ’) and TDE node 306e (“fromdt”)
- BDE node 310b (“Contract End Date”) describes or gives meaning to TDE node 306b (“en dt”) and TDE node 306f (“todt”).
- Such a relationship can be determined by, for example, performing semantic discovery.
- data associated with TDE nodes 306a, . . . , 306f can be analyzed by a data processing system to generate a data profile for each TDE.
- the data profile can include information representing statistical attributes for data values of the TDE, such as a minimum length of the data values of the TDE, a maximum length of the data values of the TDE, a most common data value of the TDE, a least common data value of the TDE, a maximum data value of the TDE, and/or a minimum data value of the TDE, among others.
- the data profiles can then be processed to discover, classify, and associate each TDE with a BDE having a term or label representing the semantic meaning of the TDE.
- a plurality of classification tests e.g., a pattern analysis, a business term analysis, a fingerprint analysis, and a keyword search, among others
- the metadata model 302 can be updated to include an edge that links the BDE and the TDE.
- nodes 308a, 308b specifying characteristics of datasets and their TDEs.
- node 308a specifies characteristics of the dataset “Cust Contr” corresponding to node 304a and its TDEs corresponding to nodes 306a-306c.
- node 308a can specify that the “cid” field serves as a primary key for dataset “Cust_Contr” and the “st_df ’ and “en_df ’ fields within the dataset “Cust Contr.” In other words, node 308a specifies that values of the “cid” field uniquely identify records containing values for the fields (e.g., “st_dt” and “en_dt”) within the dataset “Cust_Contr.” Node 308a can also specify, for example, that values of the “st_dt” field are of a date data type, and that values of the “en dt” field are of a date data type.
- node 308a specifies a definition of an expanded view dataset that includes “Cust Contr” (as the base dataset and “Service Agimt” (as a related dataset).
- node 308a can specify that the “Cust Contr” and “Service Agrmt” datasets are related to one another according to a primary-foreign key relationship, with the “cid” field in “Cust_Contr” serving as the primary key, and the “uid” field in “Service_Agrmt” serving as the foreign key.
- Node 308a can also include logic for joining “Cust_Contr” and “Service Agrmt” based on the primary-foreign key relationship (e.g., by joining records of “Cust_Contr” and “Service_Agrmt” where values of “cid” match “uid”).
- node 308a can also include characteristics for datasets related to “Cust_Contr” (e.g., “Service_Agrmt”) and their TDEs.
- Metadata model 302 further includes node 308b specifying characteristics of the dataset “Service Agrmt” corresponding to node 304b and its TDEs corresponding to nodes 306d-306f.
- node 308b can specify that the “uid” field serves as a primary key for dataset “Service Agrmt” and the “fromdt” and “todt” fields within the dataset “Service Agrmt.”
- node 308b specifies that values of the “uid” field uniquely identify records containing values for the fields (e.g., “fromdt” and “todt”) within the dataset “Service_Agrmt.”
- Node 308b can also specify that values of the “fromdt” field are of a string data type, and that values of the “todt” field are of a string data type.
- node 308b specifies a definition of an expanded view dataset that includes “Service Agimt’ ’ (as the base dataset and “Cust Contr” (as a related dataset).
- node 308b can specify that the “Service Agrmt” and “Cust Contr” datasets are related to one another according to a primary -foreign key relationship, with the “uid” field in “Service_Agrmt” serving as the primary key, and the “cid” field in “Cust_Contr” serving as the foreign key.
- Node 308b can also include logic for joining “Service Agrmt” and “Cust Contr” based on the primary-foreign key relationship (e.g., by joining records of “Service Agrmt” and “Cust Contr” where values of “uid” match “cid”).
- node 308b can also include characteristics for datasets related to “Service Agrmt” (e.g., “Cust Contr”) and their TDEs, such as the characteristics described above. Referring to FIG. 3B, an example of generating a metadata control for data processing is shown.
- guided expression editor 206 retrieves from metadata repository 214 a list of BDEs 350 that can be used (e.g., as source values) in generating a control. For example, guided expression editor 206 can query the metadata model stored in the metadata repository 214 for all BDE nodes to retrieve the list of BDEs 350. In this example, the list of BDEs 350 includes “Contract Start Date” and “Contract End Date.”
- guided expression editor 206 generates user interface (UI) data 352 based in part on the BDEs 350 and transmits the UI data 352 to development environment 218.
- UI user interface
- Development environment 218 uses the UI data 352 to render a graphical user interface (GUI) 354 that enables a user to select one or more items of metadata (e.g., a BDE) and guides a user through a series of selections to specify one or more conditions, rules, or other logic with respect to the selected item(s) of metadata, thereby generating the metadata control.
- GUI graphical user interface
- GUI 354 includes a first portion 354a that enables a user to select a BDE from the list of BDEs 350 to be used as a source value or parameter in the control.
- the BDE “Contract Start Date” is selected (at T3) as the source value, as shown by the checkmark in portion 354a.
- GUI 354 also includes a second portion 354b that enables a user to select an operator from a set of operators to be applied to the source value.
- the set of operators presented in portion 354b is selected based on the particular source value selected in portion 354a.
- the operator “is less than” is selected (at T3), as shown by the checkmark in portion 354b.
- development environment 218 transmits selection data 356 specifying the selected source value (e.g., “Contract Start Date”) and the selected operator (e.g., “is less than”) to the guided expression editor 206.
- guided expression editor 206 determines, based on the selection data 356, one or more additional values (e.g., BDEs) and/or operators, if any, that can be used in generating the control.
- additional values e.g., BDEs
- guided expression editor 206 can query the metadata repository 214 (e.g., the metadata model) for additional values and/or operators based on the selection data 356.
- guided expression editor 206 generates additional UI data 358 based in part on the determined additional values and/or operators and transmits the UI data 358 to development environment 218.
- Development environment 218 uses the UI data 358 to render an updated GUI 354’.
- Updated GUI 354’ includes a third portion 354c that enables a user to select a second BDE to be used in the control.
- the BDE “Contract End Date” is selected (at Te), as shown by the checkmark in portion 354c.
- a user interface element 360 can be selected to approve the control definition and transmit (at T7) additional selection data 362 specifying the selections.
- the control definition can be tested against data before approval to determine whether the control is working as intended.
- guided expression editor 206 guides a user in defining a control at a logical level (BDE level) without the need for the user to understand or access the underlying data, and without requiring the user to write code (e.g., by presenting valid choices for defining the control, rather than requiring the user to write or edit the control’s underlying code), thereby avoiding syntax errors.
- BDE level logical level
- guided expression editor 206 transmits control data 364 specifying the control definition (e.g., “Contract Start Date is less than Contract End Date,” according to selections 356, 362) to control generator 208.
- control generator 208 generates control data 366 (which may be the same or different from control data 364) that includes instructions to add the control to metadata model 302’.
- control data 366 can include instructions to add node 368 to metadata model 302’ representing the control “Contract Start Date ⁇ Contract End Date.”
- Control data 366 can also include instructions to add edges (e.g., references, pointers, etc.) linking node 368 to nodes 310a, 310b representing the BDEs “Contract Start Date” and “Contract End Date.”
- edges e.g., references, pointers, etc.
- client device 220 transmits a specification 400 to execution engine 212.
- the specification 400 can be transmitted in response to user input, at (pre-) determined times, or in response to various triggering events, such as changes to the metadata model 302’.
- the specification 400 includes instructions for generating and/or executing a computer program (e.g., a dataflow graph) to perform operations on data.
- the specification 400 can include instructions to access data from one or more source systems, optionally transform the data, and store the (transformed) data in one or more destination systems.
- the specification 400 includes instructions to access the “Cust Contr” dataset from storage system 222a and the “Service_Agrmt” dataset from storage system 222n, and store governed (e.g., cleansed and conformed) versions of these datasets in storage system 224 (e.g., as “Cust_Contr_Cleansed” and “Service Agrmt Cleansed”).
- the specification 400 can be a pipeline object that includes a data object or other data structure specifying actions to be performed in ingesting data, such as described in U.S. Application No. 18/496,543, titled “Metadata Driven Data Ingestion and Data Processing,” the entire content of which is incorporated herein by reference.
- control identifier 210 Upon receipt of the specification 400, execution engine 212 transmits the specification 400 to control identifier 210 with a request for applicable controls.
- the request for applicable controls can also include a request for characteristics of the data to which the controls are to be applied.
- control identifier 210 identifies the items of data that are to be processed in accordance with the specification. For example, control identifier 210 can parse the specification 400 to extract technical metadata (e.g., dataset names, field names, etc.) representing the items of data that are to be accessed or otherwise processed in accordance with the specification. Control identifier 210 can then send a query 402 to metadata repository 214 for controls and characteristics associated with the extracted technical metadata.
- technical metadata e.g., dataset names, field names, etc.
- the query 402 can include a request for controls associated with the “Cust Contr” and “Service Agrmt” datasets.
- the query 402 can include a request for controls associated with the “cid,” “st_dt,” and “en_df ’ fields of the “Cust_Contr” dataset, and with the “uid,” “fromdt,” and “todt” fields of the “Service_Agrmt” dataset.
- the metadata model 302’ is traversed to determine the controls and characteristics that are applicable to the items of data to be processed in accordance with the specification 400.
- a data processing system e.g., the data processing system 202 or another data processing system associated with the metadata repository 214.
- accessing the node 304a can include, for example accessing from hardware storage a data object or data structure that the node represents.
- the edges associated with dataset node 304a can be followed to identify related nodes, such as the dataset characteristics node 308a.
- dataset node 304a may include references to dataset characteristics node 308a, such as by including a unique identifier for dataset characteristics node 308a.
- following the edges can include identifying and accessing the dataset characteristics node 308a associated with the respective reference (e.g., unique identifier).
- dataset node 304a can include pointers to memory locations (e.g., memory addresses) for dataset characteristics node 308a, and following the edges can include accessing the dataset characteristics node 308a at the specified memory location.
- the metadata stored in the dataset characteristics node 308a can be read to obtain the characteristics for the corresponding dataset (e.g., “Cust Contr”) and its TDEs.
- the dataset characteristics node 308a can specify a definition (or specify characteristics used to create a definition) of an expanded view dataset that includes “Cust Contr” (as the base dataset) and “Service Agimt” (as a related dataset).
- the dataset characteristics node 308a can include instructions for creating an expanded view dataset (e.g., a wide record) by joining “Cust_Contr” and “Service_Agrmt” using the keys “cid” and “uid.”
- the dataset characteristics node 308a provides logical access to characteristics, such as data types and intra- and interdataset relationships, for the “Cust Contr” dataset, its related dataset(s) (e.g., “Service Agimt”), and their TDEs.
- edges associated with dataset node 304a can be followed to identify the linked TDE nodes 306a, 306b, and 306c.
- dataset node 304a (or a separate edge data structure or object referenced by dataset node 304a) may include references to TDE nodes 306a, 306b, and 306c, such as by including unique identifiers for each of TDE nodes 306a, 306b, 306c.
- following the edges can include identifying and accessing the TDE nodes 306a, 306b, and 306c associated with the respective references (e.g., unique identifiers).
- dataset node 304a can include pointers to memory locations (e.g., memory addresses) for each of TDE nodes 306a, 306b, 306c, and following the edges can include accessing the TDE nodes 306a, 306b, 306c at the specified memory locations.
- pointers to memory locations e.g., memory addresses
- Similar processes can be followed to traverse other nodes in the metadata model 302’ and identify the applicable controls and characteristics.
- the edge associated with TDE node 306a can be followed to identify and access BDE node 310a (e.g., “Contract Start Date”). From here, the edge associated with the BDE node 310a is followed to control node 368.
- the data processing system determines that the associated control (e.g., “Contract Start Date ⁇ Contract End Date”) is applicable to the items of data to be processed in accordance with the specification 400.
- This control is also identified through traversal of the “Cust_Contr”-“en_dt”-“Contract End Date,” “Service_Agrmt”-“fromdt”-“Contract Start Date,” and “Service_Agrmt”-“todt”-“Contract End Date” paths of metadata model 302’, as shown by the bolded lines with arrows.
- controls data 404 specifying that the control “Contract Start Date ⁇ Contract End Date” is to be applied to TDEs “st dt,” “en dt,” “fromdt,” and “todt,” and dataset characteristics 406 including the characteristics (e.g., data types, relationships, etc.) of, for example, “st dt,” “en dt,” “fromdt,” and “todt” are returned to control identifier 210 in response to the query 402.
- control identifier 210 transmits the controls data 404 and dataset characteristics 406 to execution engine 212.
- query 402 can be an entity query, such as described in U.S. Patent No. 11,921,710, titled “Systems and methods for accessing data entities managed by a data processing system,” the entire content of which is incorporated herein by reference.
- execution of the entity query 402 can effectively traverse the metadata model 302’ to identify the relevant controls and characteristics such that the results of query 402 include the controls data 404 and dataset characteristics 406.
- the execution engine 212 uses the specification 400, the controls data 404, and dataset characteristics 406, the execution engine 212 generates instructions 408.
- the execution engine 212 can include a code generator 212a that uses a plurality of stored modules (e.g., dataflow graph components or other software components) to transform the specification 400, the controls data 404, and/or the dataset characteristics 406 into the instructions 408.
- the code generator 212a of the execution engine 212 can then generate instructions 408 to implement the control specified in control data 404 (e.g., “Contract Start Date ⁇ Contract End Date”) on the accessed data.
- control data 404 e.g., “Contract Start Date ⁇ Contract End Date”
- code generator 212a generates instructions 408 to compare values of “st_dt” with values of “en_dt” on “cid” (as opposed to, e.g., comparing “st_dt” with “todt,” which also represents “Contract End Date”).
- code generator 212a generates instructions 408 to compare “st df ’ and “en df ’ using the less than operator without further transformation (e.g., without casting the data).
- the control is specified as a preventative control in which data that does not satisfy the criteria or condition “Contract Start Date ⁇ Contract End Date” is rejected.
- code generator 212a generates instructions 408 to reject any records having a value in the “st dt” field that is less than a corresponding value in the “en df ’ field in the “Cust Contr” dataset.
- code generator 212a generates instructions 408 to compare values of “fromdt” with values of “todt” on “cid” (as opposed to, e.g., comparing “fromdt” with “en_dt,” which also represents “Contract End Date”).
- code generator 212a determines to transform values of “fromdt” and “todt” to, e.g., date data types before comparison using the less than operator, as comparing strings with the less than operator may produce unintended results. Accordingly, code generator 212a generates instructions 408 to cast “fromdt” and “todt” as dates, and then reject any records having a value in the “fromdt” field that is less than a corresponding value in the “todt” field in the “Service_Agrmt” dataset.
- Code generator 212a also generates instructions 408 to write or store the datasets governed by the control as “Cust Contr Cleansed” and “Service Agrmt Cleansed.” To do so, code generator 212a may store one or more modules (e.g., dataflow graph components or other software components) specifying instructions to write data, and may supplement these instructions based on the specification 400 to write each of the generated “Cust Contr Cleansed” and “Service Agrmt Cleansed” datasets to a specified storage system. Additional details on operations performed by execution engine 212 in generating the instructions are described in U.S. Patent No. 11,423,083, titled “Transforming a Specification into a Persistent Computer Program,” the entire content of which is incorporated herein by reference.
- modules e.g., dataflow graph components or other software components
- a compiler 212b of the execution can transform (e.g., compile) the instructions 408 into executable instructions, such as an executable computer program (e.g., an executable dataflow graph).
- an interpreter can be used instead of or in addition to the compiler 212b.
- execution engine 212 executes the executable instructions (e.g., computer program) described with reference to FIG. 4A in order to ingest the “Cust_Contr” dataset 452a from storage system 222a and the “Service_Agrmt” dataset 452b from storage system 222n, process the datasets in accordance with the control, and store the resultant “Cust Contr Cleansed” dataset 452a and the “Service_Agrmt_Cleansed” dataset 452b to storage system 224. As shown in visualization 454, execution engine 212 first reads the “Cust_Contr” and “Service_Agrmt” datasets.
- executable instructions e.g., computer program
- execution engine 212 checks whether “st_dt” is less than “en df ’ for each record in the “Cust Contr” dataset, and whether “fromdt” (casted as a date) is less than “todt” (casted as a date) for each record in the “Service_Agrmt” dataset.
- the record associated with “cid” 2002 in the “Cust Contr” dataset has failed the control, because the value of “st dt” (2/2/2022) is not less than the value for “en_df ’ (also 2/2/2022).
- the failed record is rejected (e.g., removed) from the “Cust Contr Cleansed” dataset, though other actions can be taken in some examples.
- the “Cust_Contr_Cleansed” dataset 452a and the “Service_Agrmt_Cleansed” dataset 452b are stored in the storage system 224. In this manner, a single control defined at a logical level in the metadata model is automatically applied to multiple datasets from disparate sources and having different characteristics.
- Execution engine 212 provides metadata 456 resulting from the execution to metadata repository 214 for storage.
- the metadata 456 can be stored in or otherwise associated with the corresponding control node 368 in the metadata model 302”.
- the metadata 456 specifies that three records passed the control while one record failed, and further specifies the reason for the failure. This information can be displayed to a user to enable the user to understand the results of executing the control and identify records having data quality (or other) issues. In addition, this information can be used as the basis for further controls.
- control 368 (or another control linked to the control 368) can specify rules or logic that are conditioned upon the metadata 456 resulting from execution, such as a rule to send an alert to a designated user and/or cease execution of the control 368 in response to detecting a specified number of failed records.
- metadata resulting from execution of a control is collected over time to derive statistics about execution of the control (e.g., total number of failed records, average percent of failed records, etc.). This cumulative or statistical information can be used as the basis for further controls, such as the anomaly detection controls described herein.
- Execution engine 212 also provides metadata 458 for the new datasets “Cust_Contr_Cleansed” and “Service_Agrmt_Cleansed” to metadata repository 214.
- metadata model 302 is updated to include nodes representing the new datasets, their fields, and their characteristics, as well as edges linking the nodes, as shown by the bolded portions of metadata model 302”.
- the new nodes representing TDEs of the new datasets are linked (e.g., via edges) to existing BDEs that represent the semantic meaning of the TDEs.
- new datasets e.g., “Cust_Contr_Cleansed” and “Service_Agrmt_Cleansed”
- new datasets e.g., “Cust_Contr_Cleansed” and “Service_Agrmt_Cleansed”
- interface 460 visualizing the results of applying metadata controls.
- interface 460 includes a metadata model portion 462 that visualizes a version of the metadata model (e.g., the metadata model 302”).
- a user can interact with the metadata model portion 462 to select one or more nodes of the metadata model in order to view further details about the execution of controls associated with the selected node(s), among other information.
- a user selects control node 368.
- results of execution of the control node 368 are shown in a control execution results portion 464 of the interface 460.
- control execution results portion 464 includes an execution results summary that provides information about the execution of control 368, such as the time of execution, the number of records that passed the control, the number of records that failed the control, and a reason for failure for applicable records.
- Control execution results portion 364 can also include a results table 466 that enables a user to view the results of executing the control 368 on a record-by-record basis.
- the results table 466 identifies (e.g., through highlighting or another indicator) the records that failed the control 368.
- a new dataset “Sales Contr” 500 is added to storage system 222n, and metadata 502 for the new dataset (which can be discovered as described herein) is provided to the metadata repository 214.
- the metadata model 302”’ is updated to incorporate the new dataset, as shown by the bolded portions of metadata model 302”’.
- metadata model 302’” can be updated with a dataset node 304c representing the “Sales Contr” dataset, and TDE nodes 306g, 306h, 306i representing the “tid,” “end,” and “start” fields of the “Sales_Contr” dataset, respectively.
- the metadata model 302’ is also updated with edges to link dataset node 304c and TDE nodes 306g, 306h, 306i to indicate that the “cid,” “end,” and “start” are TDEs (e.g., fields) of the “Sales Contr” dataset.
- Metadata model 302’ is also updated with node 308c specifying characteristics of the dataset “Sales Contr” corresponding to node 304c and its TDEs corresponding to nodes 306g-306i.
- node 308c can specify that the “tid” field serves as a primary key for dataset “Sales_Contr” and the “end” and “start” fields within the dataset “Sales Contr.”
- node 308c specifies that values of the “tid” field uniquely identify records containing values for the fields (e.g., “st dt” and “en dt”) within the dataset “Cust Contr.”
- node 308c specifies a definition of an expanded view dataset that includes “Sales Contr” (as the base dataset and “Cust Contr” and “Service Agrmt” (as related datasets).
- node 308c can specify that the “Sales Contr” and “Cust Contr” datasets are related to one another according to a primary-foreign key relationship, with the “tid” field in “Sales_Contr” serving as the primary key and the “cid” field in “Cust_Contr” serving as the foreign key.
- node 308c can specify that the “Sales Contr” and “Service Agrmt” datasets are related to one another according to a primary- foreign key relationship, with the “tid” field in “Sales_Contr” serving as the primary key and the “uid” field in “Service_Agrmt” serving as the foreign key.
- Node 308c can also include logic for joining “Sales_Contr,” “Cust_Contr,” and “Service_Agrmt” based on the primary-foreign key relationships.
- node 308c can also include characteristics for datasets related to “Sales Contr” (e.g., “Cust Contr” and “Service Agrmt”) and their TDEs.
- Metadata model 302”’ can also be updated to include nodes 308a’ and 308b’ that include the relationship between “Sales Contr” and each of “Cust_Contr” and “Service_Agrmt.”
- metadata model 302’ is also updated to include edges linking the TDE nodes 306g, 306h, and 306i representing items of technical metadata for the new dataset 500 to BDE nodes 310a, 310b that specify a semantic meaning for the TDEs.
- semantic discovery processes can be performed as described herein to determine the semantic meaning of fields corresponding to TDE nodes 306g, 306h, and 306i .
- it is determined via semantic discovery that the “start” field of the “Sales Contr” dataset represents a “Contract Start Date.”
- an edge is added to metadata model 302’” to link the TDE node 306i with BDE node 310a through the characteristics node 308i .
- Metadata model 600 includes a node 604a specifying metadata for a dataset “Cust Contr Short” (similar to “Cust Contr” represented in metadata model 302) and a node 604b specifying metadata for a dataset “Service Agrmt Short” (similar to “Service Agrmt” represented in metadata model 302).
- Metadata model 600 includes nodes 606a, 606b specifying metadata (e.g., field names) for TDEs “st df ’ and “cid” that are part of dataset “Cust Contr Short,” and nodes 606c, 606d specifying metadata for TDEs “uid” and “todt” that are part of dataset “Service Agrmt Short.”
- Metadata model 600 also includes a node 610a specifying metadata for a BDE “Contract Start Date,” and a node 610b specifying metadata for a BDE “Contract End Date.”
- BDE node 610a (“Contract Start Date”) describes or gives meaning to TDE node 606a (“st dt”)
- BDE node 610b (“Contract End Date”) describes or gives meaning to TDE node 606d (“todt”).
- metadata model 600 includes nodes 608a, 608b specifying correlations among datasets and fields, among other characteristics.
- nodes 608a, 608b specify a dataset correlation between the datasets “Cust Contr Short” and “Service Agrmt Short” corresponding to nodes 604a, 604b (as well as their TDEs).
- Dataset correlation can specify, for example, a primary -foreign key relationship between datasets “Cust Contr Short” and “Service_Agrmt_Short,” such as by specifying that the “cid” field serves as a primary key for the dataset “Cust Contr Short” that relates to foreign key field “uid” in “Service_Agrmt_Short” (and vice versa, where “uid” is the primary key and “cid” is the foreign key).
- dataset correlation includes instructions for generating a wide record (or an expanded view dataset) that includes the data from a base dataset (e.g., one of “Cust Contr Short” and “Service Agrmt Short”) and related datasets (e.g., the other of “Cust Contr Short” and “Service Agrmt Short”).
- base dataset e.g., one of “Cust Contr Short” and “Service Agrmt Short”
- related datasets e.g., the other of “Cust Contr Short” and “Service Agrmt Short”.
- Node 608a can also specify a field correlation between TDEs “st_dt” and “cid” corresponding to nodes 606a, 606b.
- node 608a specifies that the “cid” field serves as a primary key for the “st_df ’ field within the dataset “Cust Contr Short” such that values of the “cid” field uniquely identify records containing values for the “st df ’ field.
- Node 608b can specify a field correlation between TDEs “uid” and “todf ’ corresponding to nodes 606c, 606d.
- node 608b can specify, for example, that the “uid” field serves as a primary key for the “todt” field within the dataset “Service Agrmt Short” such that values of the “uid” field uniquely identify records containing values for the “todt” field.
- metadata model 600 can also include other characteristics for the datasets and TDEs, such as record format or data type (not shown).
- execution engine 212 receives a specification 614 that includes instructions to access the “Cust Contr Short” dataset from storage system 222a, and store a governed (e.g., cleansed and conformed) version of this dataset in storage system 224 (e.g., as “Cust_Contr_Short_Cleansed”).
- a specification 614 that includes instructions to access the “Cust Contr Short” dataset from storage system 222a, and store a governed (e.g., cleansed and conformed) version of this dataset in storage system 224 (e.g., as “Cust_Contr_Short_Cleansed”).
- execution engine 212 transmits the specification to control identifier 210 with a request for applicable controls (not shown).
- the metadata model 600 is traversed to determine the controls and characteristics that are applicable to the items of data to be processed in accordance with the specification 614.
- a data processing system e.g., the data processing system 202 or another data processing system associated with the metadata repository 214) can first access node 604a representing the dataset “Cust_Contr_Short.”
- the edges associated with dataset node 604a can be followed to identify and access the linked dataset characteristics node 608a to collect characteristics (e.g., field and dataset correlations) for use in applying any applicable controls. Similar processes can be followed to traverse other nodes in the metadata model 600 and identify the applicable controls. For example, the edges associated with dataset node 604a can be followed to identify and access TDE nodes 606a, 606b.
- the edge associated with TDE node 606a can be followed to identify and access BDE node 610a (e.g., “Contract Start Date”). From here, the edge associated with the BDE node 610a is followed to control node 612.
- BDE node 610a e.g., “Contract Start Date”. From here, the edge associated with the BDE node 610a is followed to control node 612.
- the data processing system determines that the associated control (e.g., “Contract Start Date ⁇ Contract End Date”) is applicable to the items of data to be processed in accordance with the specification.
- the data processing system also determines that a “Contract End Date” that corresponds to the identified “Contract Start Date” (e.g., “st dt” of “Cust Contr Short”) is needed to evaluate the control.
- the data processing system determines, based on the characteristics obtained from dataset characteristics node 608a, that “todf ’ of “Service_Agrmt_Short” corresponds to (e.g., correlates to) “st_dt” of “Cust_Contr_Short.” Accordingly, the data processing system traverses back down the metadata model 600 along the “Contract End Date”-“tod
- controls data 616 specifies that the control “Contract Start Date ⁇ Contract End Date” is to be applied to TDEs “st dt” of “Cust Contr Short” and “todt” of “Service Agrmt Short,” and wide record instructions 618 including the identified dataset and field correlations and instructions for generating a wide record that includes the data from
- control identifier 210 transmits the controls data 616 and the wide record instructions 618 to execution engine 212.
- execution engine 212 uses the specification 614, the controls data 616, and the wide record instructions 618 to generate executable instructions to access the “Cust Contr Short” dataset 620a, implement the control “Contract Start Date ⁇ Contract End Date”, and store a governed version of this dataset in storage system 224 (e.g., as “Cust Contr Short Cleansed”).
- execution engine 212 uses wide record instructions 618 for generating executable instructions to access both the “Cust Contr Short” dataset 620a and the “Cust Contr Short” dataset 620b and join records of the datasets 620a, 620b where the value of the primary key “cid” matches the value of the foreign key “uid,” thereby creating a temporary wide record 622 that includes the data needed to evaluate the control.
- the bolded records of “Cust Contr Short” dataset 620a and the “Service_Agrmt_Short” dataset 620b have matching key values “2002” and “2003” that are joined to produce the wide record 622 when the instructions are executed.
- Execution engine 212 also generates instructions to write or store the dataset governed by the control as “Cust Contr Cleansed.” Referring to FIG. 6B, an example of applying a metadata control to multiple datasets is shown. In this example, execution engine 212 executes the executable instructions (e.g., computer program) described with reference to FIG.
- execution engine 212 first reads the “Cust Contr Short” and “Service Agrmt Short” datasets.
- execution engine 212 joins the two datasets on matching values of “cid” and “uid” to produce a temporary wide record (e.g., “Wide_record.dat”).
- Execution engine 212 checks whether “st_dt” is less than “todf ’ for each record in the wide record dataset.
- the record associated with “cid” value 2002 has failed the control, because the value of “st_dt” (2/2/2022) is not less than the value for “todt” (2/1/2022).
- the failed record is rejected (e.g., removed) from the wide record dataset, though other actions can be taken in some examples.
- execution engine 212 After applying the control, execution engine 212 reformats the wide record to produce a “Cust Contr Short Cleansed” dataset having the same record format as the original “Cust Contr Short” dataset (e.g., by dropping the “todt” field from the wide record). Execution engine 212 then stores “Cust Contr Short Cleansed” in the storage system 224. In this manner, a control defined at a logical level in the metadata model is automatically applied across multiple disparate datasets.
- Execution engine 212 provides metadata 652 resulting from the execution to metadata repository 214 for storage.
- the metadata 652 can be stored in or otherwise associated with the corresponding control node 612 in the metadata model 602’.
- the metadata 652 specifies that three records passed the control while one record failed, and further specifies the reason for the failure. This information can be displayed to a user, trigger further actions (e.g., alerts), and/or used to derive statistical insights for further controls, such as anomaly detection controls.
- Execution engine 212 also provides metadata 654 for the new dataset “Cust Contr Short Cleansed” to metadata repository 214.
- metadata model 600’ is updated to include a node 604c representing the new dataset, nodes fields 606e, 606f representing its fields, and node 608c representing its characteristics, as well as edges linking the nodes, as shown by the bolded portions of metadata model 600’.
- the new node 606f representing the TDE “st_df ” of the new dataset is linked (e.g., via an edge) to existing node 610a of the BDE “Contract Start Date” that represents the semantic meaning of the TDE “st_df .”
- new datasets e.g., “Cust Contr Short Cleansed” will continue to be governed in accordance with the specified controls without the need to define new controls for the new datasets.
- a metadata model 700 includes an application node 702 that specifies an application or data domain (e.g., Data Application) used to group BDEs (e.g., BDEs 310a, 310b) and other metadata.
- templated control node 704 is linked (via an edge) to data application node 702, meaning that the templated control 704 applies to items of metadata grouped within the data application node 702 in the metadata model 700.
- Metadata model 700 also includes a node 706a specifying that the BDE node 310a representing “Contract Start Date” is a required field, and a node 706b specifying that the TDE node 306a corresponding to “st dt” is a required field.
- the metadata specified by nodes 706a, 706b can be incorporated into nodes 310a, 306a, respectively.
- execution engine 212 transmits a request to control identifier 210 for applicable templated (and other) controls.
- metadata model 700 is traversed to identify which templated controls to apply to data to be processed in accordance with the specification (e.g., as represented by technical metadata in the specification).
- the dataset “Cust Contr” represented by node 304a is to be processed in accordance with the specification.
- a data processing system accesses dataset node 304a representing “Cust_Contr,” and executes instructions to identify one or more other nodes that are linked to node 302a by an edge.
- the data processing system determines that dataset node 304a is linked to characteristics node 308a, and is also linked to TDE node 306a, which in turn is linked to BDE node 310a, which in turn is linked to application node 702, which in turn is linked to templated control node 704. As a result of this traversal, the data processing system determines that the templated control associated with node 704 is applicable to data associated with each of nodes 304a, 306a, 308a, 310a, and 702, among others.
- the data processing system determines that TDE node 306a is a required field from traversal of the edge between node 306a and node 706b, and that BDE node 310a is a required field from traversal of the edge between node 310a and node 706a.
- Execution engine 212 receives templated control data from control identifier 210 and generates instructions for application of the templated control to the data represented by “st df ’ and “Contract Start Date.” These instructions can include, for example, executable logic to check for the presence of data in the “st df ’ and “Contract Start Date” fields. Upon execution of these instructions, execution engine 212 accesses the “Cust Contr” dataset and applies the control specified by the templated control node 704 to the “st df ’ field to check for the presence of a value in each record.
- execution engine 212 outputs a version of the “Cust Contr” dataset in which the templated control has been applied, as well as other executable logic that execution engine 212 has been configured to apply. This dataset is stored in storage system 224.
- templated control is also applied to the data represented by BDE node 310a (e.g., to check that a name, description, and/or other attributes of the BDE are present).
- execution engine 212 is programmed with one or more computation graphs to apply to the “Cust Contr” dataset, is programmed with executable logic to apply to the “Cust Contr” dataset, and so forth.
- the instructions generated by execution engine 212 can include executable logic representing the control specified by templated control 704, thereby enabling execution engine 212 to apply the control specified by templated control 704.
- templated control 704 includes a template portion and a control portion.
- the control portion is or includes a second condition to check that that field is populated or to check the presence of data in that field.
- templated control includes two conditions, neither of which are defined with regard to any particular item of data, any particular item of technical metadata, any particular item of logical metadata, and so forth.
- templated control 704 allows for increased efficiency during data processing. This is because a templated control only needs to be defined once - for example, at a logical level. Once the templated control is defined, then it can be applied to all kinds of data represented in the metadata model.
- templated control 704 can be applied to data received across multiple different data streams or data sources. This is because the metadata model itself can represent data from across multiple data sources and/or multiple data streams. In this manner, templated controls ensure enhanced data quality and ingestion or storage of datasets. Enhanced data quality results in more efficient data processing because the system does not need to process data that is formatted incorrectly or that has poor quality.
- FIG. 7B an example of applying a templated control to a new dataset is shown.
- a new dataset “Sales Contr” 750 is added to storage system 222n, and metadata 752 for the new dataset (which can be discovered as described herein) is provided to the metadata repository 214.
- the metadata model 700’ is updated to incorporate the new dataset, as shown by the bolded portions of metadata model 700’.
- metadata model 700’ can be updated in a similar manner as described with reference to FIG.
- the metadata model 700’ is also updated with edges to link dataset node 304c and TDE nodes 306g, 306h, 306i, 306j to indicate that “cid,” “end,” “start,” and “ssn” are TDEs (e.g., fields) of the “Sales_Contr” dataset.
- Metadata model 700’ is also updated with node 308c specifying characteristics (e.g., data types, record formats, relationships, etc.) of the “Sales Contr” dataset and its TDEs.
- Dataset characteristics nodes 304a’, 304b’ are also updated to account for the new dataset.
- metadata model 800’ is also updated to include edges linking the TDE nodes 306h, 306i representing items of technical metadata for the new dataset 750 to BDE nodes 310a, 310b that specify a semantic meaning for the TDEs (e.g., as determined via semantic discovery processes).
- templated control 704 can automatically be inherited or propagated down to new dataset 750 (represented by node 304c) and its associated TDEs (represented by nodes 306g, 306h, 306i, 306j). As such, if node 754 is added to metadata model 700’ to specify that TDE node 306j is a required field, the templated control 704 will automatically be applied to node 306j.
- templated controls promote efficient application of data quality controls, without these data quality controls having to be re-defined for each newly ingested or stored dataset.
- semantic discovery ensures that new nodes are correctly linked into the metadata model 700’. Incorrect linkages result in processing inefficiencies, as data quality controls (or rules) would then be applied to incorrect datasets. As such, the correct linkages result in increased processing efficiency through application of the data quality controls to the correct datasets.
- a data quality control is a type of templated control.
- an anomaly detection control measures criteria of data over time to detect significant changes (e.g., anomalies) that may be indicative of a data quality issue.
- Anomaly detection controls can be defined once at a logical level and automatically propagated down to the multiple data items in accordance with the techniques described herein.
- a metadata model 850 includes a control node 852 representing an anomaly detection control 802 that performs a check to identify a change in day-over-day percent completeness and is defined with respect to a BDE “Loans” represented by node 854.
- control 802 can be automatically applied against all data associated with the BDE “Loans” in the metadata model 850 to measure the percent completeness of the data. If the day-over-day completeness in the data increases or decreases by more than a threshold, the control 802 can trigger an event (e.g., an alert, a message, etc.) in order to warn of the potential data quality issue.
- an event e.g., an alert, a message, etc.
- the anomaly detection control 802 is defined with respect to the “Loans” BDE node 854, which is represented in the metadata model 850 by anomaly detection control node 852 being linked by an edge to the “Loans” BDE node 854.
- the anomaly detection control 802 propagates down the metadata model 850 from node 852 to BDE node 856a (“US Loans”) and BDE node 856b (“European Loans”), and then down to TDEs 858a, . . ., 858f and datasets 860a, 860b. In this manner, the anomaly detection control 802 only needs to be defined a single time at a logical level and then is automatically applied to all items of underlying data.
- the anomaly detection control 802 (or another control, rule, or other logic) to data in only a portion or segment of the metadata model 850.
- an anomaly detected by the control when applied at the “Loans” level may not indicate whether the anomaly is due to data underlying “US Loans” or data underlying “European Loans.” Therefore, in some cases, it can be beneficial to apply the anomaly detection control 802 (or another control) to a particular segment 862 of the metadata model 850. To do so, execution engine 212 can receive (e.g., from client device 220) instructions to apply the control to the segment 862 (e.g., “US Loans”) of the metadata model 850.
- Execution engine 212 can pass these instructions to control identifier 210, which can then identify applicable controls and characteristics for the segment, which can be returned to execution engine 212. Using this information, execution engine 212 can generate executable instructions to apply the anomaly detection control 802 to data corresponding to the segment 862 of the metadata model 850. Segmenting in this way enables top-down controls to be applied with greater granularity, thereby facilitating improved data governance in certain scenarios and aiding in the identification of the root cause of data quality issues.
- a process 900 is shown for generating metadata controls for data processing. Operations of the process 900 include storing, in a data store, a metadata model including one or more first items of metadata and one or more second items of metadata (902).
- At least one of the one or more first items of metadata can specify a semantic meaning associated with at least one of the one or more second items of metadata.
- the metadata model can specify a relationship between the at least one of the one or more first items of metadata and the at least one of the one or more second items of metadata.
- a control defined based on the at least one of the one or more first items of metadata specifying the semantic meaning is received (904).
- the metadata model is updated to include a third item of metadata representing the control (906).
- a relationship between the third item of metadata representing the control and the at least one of the one or more first items of metadata is specified (908).
- the updated metadata model with the specified relationship for the control is stored in a data store to be applied to one or more data elements associated with the at least one of the one or more second items of metadata with the relationship in the metadata model to the at least one of the one or more first items of metadata (910).
- a process 1000 for applying metadata controls for data processing.
- Operations of the process 1000 include receiving a specification to process at least a portion of a dataset (1002). Responsive to the specification, one or more characteristics of the dataset are accessed (1004), one or more controls received from a development environment that to be applied to one or more values of a field of the dataset in accordance with a metadata model are identified (1006), by: accessing a first instance of a data structure storing an identifier that corresponds to the dataset (1008); based on a reference stored in the first instance of the data structure, accessing a second instance of a data structure associated with the field of the dataset (1010); based on a reference stored in the second instance of the data structure, accessing a third instance of a data structure associated with metadata that describes one or more values of the field of the dataset (1012); based on a reference stored in the third instance of the data structure, accessing a fourth instance of a data structure storing a control defined based on the metadata that describes one
- an example operating environment for implementing embodiments of the present invention is shown and designated generally as computing device 1100.
- Essential elements of a computing device 1100 or a computer or data processing system or client or server are one or more programmable processors 1102 for performing actions in accordance with instructions and one or more memory devices 1104 for storing instructions and data.
- a computer will also include, or be operatively coupled, (via bus 1112, fabric, network, etc.) to I/O components 1106, e.g., display devices, network/communication subsystems, etc. (not shown) and one or more mass storage devices 1108 for storing data and instructions, etc., and a network communication subsystem 1110, which are powered by a power supply (not shown).
- I/O components 1106 e.g., display devices, network/communication subsystems, etc. (not shown) and one or more mass storage devices 1108 for storing data and instructions, etc., and a network communication subsystem 1110, which are powered by a power supply (not shown).
- the computer program instructions and data may be stored in non-transitory form, such as being embodied in a hardware storage device, including, e.g., a volatile storage medium (e.g., random access memory (RAM)) or a non-volatile storage medium (e.g., disk), or any other non-transitory medium, using a physical property of the medium (e.g., magnetic domains, or electrical charge) for a period of time (e.g., the time between refresh periods of a dynamic memory device such as a dynamic RAM).
- a hardware storage device including, e.g., a volatile storage medium (e.g., random access memory (RAM)) or a non-volatile storage medium (e.g., disk), or any other non-transitory medium, using a physical property of the medium (e.g., magnetic domains, or electrical charge) for a period of time (e.g., the time between refresh periods of a dynamic memory device such as a dynamic RAM).
- the software may be provided on a tangible, non-transitory medium, such as a CD-ROM or other computer-readable medium (e.g., readable by a general or special purpose computing system or device), or may be delivered (e.g., encoded in a propagated signal) over a communication medium of a network to a tangible, non-transitory medium of a computing system where it is executed.
- a specialpurpose computer or using special-purpose hardware, such as coprocessors or field- programmable gate arrays (FPGAs) or dedicated, application-specific integrated circuits (ASICs).
- the processing may be implemented in a distributed manner in which different parts of the computation specified by the software are performed by different computing elements.
- Each such computer program is stored on or downloaded (from a cloud computing infrastructure or other remote source) to a computer-readable storage medium (e.g., solid state memory or media, or magnetic or optical media) of a storage device accessible by a general or special purpose programmable computer, for configuring and operating the computer when the storage device medium is read by the computer to perform the processing described herein.
- a computer-readable storage medium e.g., solid state memory or media, or magnetic or optical media
- Each such computer program may also be accessed as a service provided by cloud computing infrastructure.
- the embodiments described herein may also be implemented as a tangible, non-transitory medium configured with a computer program, where the medium so configured causes a computer to operate in a specific and predefined manner to perform one or more of the processing steps described herein.
- the computer program may include one or more modules of a larger program, for example, that provides services related to the design, configuration, and execution of dataflow graphs.
- the modules of the program e.g., elements of a dataflow graph
- a computer having a display device (monitor) for displaying information to the user, and a keyboard and a pointing device, (e.g., a mouse or a trackball) by which the user can provide input to the computer.
- a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user (for example, by sending web pages to a web browser on a user’s device in response to requests received from the web browser).
- Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification), or any combination of one or more such back end, middleware, or front end components.
- the components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network).
- Examples of communication networks include a local area network (“LAN”), a wide area network (“WAN”), an inter-network (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks).
- LAN local area network
- WAN wide area network
- inter-network e.g., the Internet
- peer-to-peer networks e.g., ad hoc peer-to-peer networks.
- the computing system can include clients and servers.
- a client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship with each other.
- a server transmits data (e.g., an HTML page) to a client device (e.g., for purposes of displaying data to and receiving user input from a user interacting with the user device).
- client device e.g., for purposes of displaying data to and receiving user input from a user interacting with the user device.
- Data generated at the client device e.g., a result of the user interaction
- Any computation described herein can be expressed as a dataflow graph having dataflow graph components (e.g., data processing components and/or datasets).
- a dataflow graph can be represented by a directed graph that includes nodes or vertices, representing the dataflow graph components, connected by directed links or data flow connections, representing flows of work elements (e.g., data) between the dataflow graph components.
- the data processing components include code for processing data from at least one data input, (e.g., a data source) and providing data to at least one data output, (e.g., a data sink) of a system.
- the dataflow graph can thus implement a graph-based computation performed on data flowing from one or more input datasets through the graph components to one or more output datasets.
- the dataflow graph itself is executable, e.g., by compiling or otherwise processing the dataflow graph to generate executable computer code.
- a system for executing such graph-based computations is described in U.S. Pat. No. 5,966,072, titled “Executing Computations Expressed as Graphs,” incorporated herein by reference.
- a component may be an upstream component, a downstream component, or both.
- An upstream component includes a component that outputs data to another component.
- a downstream component includes a component that receives data from another component.
- components include input and output ports.
- the links are directed links that are coupled from an output port of an upstream component to an input port of a downstream component.
- the ports have indicators that represent characteristics of how data is written to and read from the links and/or how the components are controlled to process data. These ports may have various characteristics. For example, one characteristic of a port is its directionality as an input port or output port.
- the directed links represent data and/or control being conveyed from an output port of an upstream component to an input port of a downstream component.
- a subset of the components serves as sources and/or sinks of data from the overall computation, for example, to and/or from data files, database tables, and external data flows.
- Parallelism can be achieved at least by enabling different components to be executed in parallel by different processes (hosted on the same or different server computers or processor cores), where different components executing in parallel on different paths through a dataflow graph is referred to as component parallelism, and different components executing in parallel on different portions of the same path through a dataflow graph is referred to as pipeline parallelism.
- the executable dataflow graph implements a graph-based computation performed on data flowing from one or more input datasets of a data source through the data processing components to one or more output datasets, wherein the dataflow graph is specified by data structures in the data storage, the dataflow graph having the nodes that are specified by the data structures and representing the data processing components connected by the one or more links, the links being specified by the data structures and representing data flows between the data processing components.
- An execution environment or runtime environment is coupled to the data storage and is hosted on one or more computers, the runtime environment including a pre-processing module configured to read the stored data structures specifying the dataflow graph and to allocate and configure system resources (e.g.
- the runtime environment including the execution module to schedule and control execution of the computation of the data processing components.
- the runtime or execution environment hosted on one or more computers is configured to read data from the data source and to process the data using an executable computer program expressed in form of the dataflow graph.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Computer Security & Cryptography (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Un procédé d'utilisation d'un environnement de développement pour générer automatiquement du code à partir d'un modèle de métadonnées multiniveau consiste à : recevoir une spécification pour traiter un ensemble de données, et, en réponse, accéder à des caractéristiques d'ensemble de données et identifier des commandes, reçues d'un environnement de développement, à appliquer à un champ de l'ensemble de données conformément à un modèle de métadonnées par : accès à une première instance d'une structure de données qui correspond à l'ensemble de données ; sur la base d'une référence présente dans la première instance, accès à une deuxième instance d'une structure de données associée au champ ; sur la base d'une référence présente dans la deuxième instance, accès à une troisième instance d'une structure de données associée à des métadonnées décrivant le champ, et sur la base d'une référence présente dans la troisième instance, accès à une quatrième instance d'une structure de données stockant une commande définie sur la base des métadonnées. Sur la base des caractéristiques d'ensemble de données, du code est généré pour appliquer la commande identifiée au champ.
Applications Claiming Priority (6)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202363613579P | 2023-12-21 | 2023-12-21 | |
| US63/613,579 | 2023-12-21 | ||
| US202363616206P | 2023-12-29 | 2023-12-29 | |
| US63/616,206 | 2023-12-29 | ||
| US18/987,691 US20250208838A1 (en) | 2023-12-21 | 2024-12-19 | Development environment for automatically generating code using a multi-tiered metadata model |
| US18/987,691 | 2024-12-19 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025137522A1 true WO2025137522A1 (fr) | 2025-06-26 |
Family
ID=94393720
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2024/061392 Pending WO2025137522A1 (fr) | 2023-12-21 | 2024-12-20 | Environnement de développement pour la génération automatique de code à l'aide d'un modèle de métadonnées multiniveau |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025137522A1 (fr) |
Citations (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5966072A (en) | 1996-07-02 | 1999-10-12 | Ab Initio Software Corporation | Executing computations expressed as graphs |
| CA2538568A1 (fr) * | 2003-09-15 | 2005-03-31 | Ab Initio Software Corporation | Interconnexion de donnees |
| US20180349134A1 (en) * | 2017-06-06 | 2018-12-06 | Ab Initio Technology Llc | User interface that integrates plural client portals in plural user interface portions through sharing of one or more log records |
| EP3594822A1 (fr) * | 2018-07-13 | 2020-01-15 | Accenture Global Solutions Limited | Système intelligent d'ingestion de données et procédé de gouvernance et de sécurité |
| US20200026711A1 (en) * | 2018-07-19 | 2020-01-23 | Ab Initio Technology Llc | Publishing to a data warehouse |
| US20210165664A1 (en) * | 2017-02-23 | 2021-06-03 | Ab Initio Technology Llc | Dynamic execution of parameterized applications for the processing of keyed network data streams |
| US20210279043A1 (en) * | 2020-03-06 | 2021-09-09 | Ab Initio Technology Llc | Generation of optimized logic from a schema |
| US11423083B2 (en) | 2017-10-27 | 2022-08-23 | Ab Initio Technology Llc | Transforming a specification into a persistent computer program |
| US20220374413A1 (en) * | 2016-11-09 | 2022-11-24 | Ab Initio Technology Llc | Systems and methods for determining relationships among data elements |
| US11704494B2 (en) | 2019-05-31 | 2023-07-18 | Ab Initio Technology Llc | Discovering a semantic meaning of data fields from profile data of the data fields |
| US20230359668A1 (en) * | 2022-05-05 | 2023-11-09 | Ab Initio Technology Llc | Dataflow graph datasets |
| US11921710B2 (en) | 2021-01-31 | 2024-03-05 | Ab Initio Technology Llc | Systems and methods for accessing data entities managed by a data processing system |
-
2024
- 2024-12-20 WO PCT/US2024/061392 patent/WO2025137522A1/fr active Pending
Patent Citations (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5966072A (en) | 1996-07-02 | 1999-10-12 | Ab Initio Software Corporation | Executing computations expressed as graphs |
| CA2538568A1 (fr) * | 2003-09-15 | 2005-03-31 | Ab Initio Software Corporation | Interconnexion de donnees |
| US20220374413A1 (en) * | 2016-11-09 | 2022-11-24 | Ab Initio Technology Llc | Systems and methods for determining relationships among data elements |
| US20210165664A1 (en) * | 2017-02-23 | 2021-06-03 | Ab Initio Technology Llc | Dynamic execution of parameterized applications for the processing of keyed network data streams |
| US20180349134A1 (en) * | 2017-06-06 | 2018-12-06 | Ab Initio Technology Llc | User interface that integrates plural client portals in plural user interface portions through sharing of one or more log records |
| US11423083B2 (en) | 2017-10-27 | 2022-08-23 | Ab Initio Technology Llc | Transforming a specification into a persistent computer program |
| EP3594822A1 (fr) * | 2018-07-13 | 2020-01-15 | Accenture Global Solutions Limited | Système intelligent d'ingestion de données et procédé de gouvernance et de sécurité |
| US20200026711A1 (en) * | 2018-07-19 | 2020-01-23 | Ab Initio Technology Llc | Publishing to a data warehouse |
| US11704494B2 (en) | 2019-05-31 | 2023-07-18 | Ab Initio Technology Llc | Discovering a semantic meaning of data fields from profile data of the data fields |
| US20210279043A1 (en) * | 2020-03-06 | 2021-09-09 | Ab Initio Technology Llc | Generation of optimized logic from a schema |
| US11921710B2 (en) | 2021-01-31 | 2024-03-05 | Ab Initio Technology Llc | Systems and methods for accessing data entities managed by a data processing system |
| US20230359668A1 (en) * | 2022-05-05 | 2023-11-09 | Ab Initio Technology Llc | Dataflow graph datasets |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP7333374B2 (ja) | データフロー環境で使用されるデータ型の自動マッピングシステムおよび方法 | |
| US10685030B2 (en) | Graphic representations of data relationships | |
| US20210232628A1 (en) | Systems and methods for querying databases | |
| US10558651B2 (en) | Search point management | |
| EP2668725B1 (fr) | Production d'informations de configuration de données | |
| US20050102325A1 (en) | Functional dependency data profiling | |
| JP2018528506A (ja) | リアルタイムデータストリームに対して実行するためのクエリの選択 | |
| CN105051729A (zh) | 数据记录的选择 | |
| US12141143B2 (en) | Partially typed semantic based query execution optimization | |
| CN111858608A (zh) | 一种数据管理方法、装置、服务器和存储介质 | |
| JP2016100005A (ja) | リコンサイル方法、プロセッサ及び記憶媒体 | |
| US20250208838A1 (en) | Development environment for automatically generating code using a multi-tiered metadata model | |
| WO2025137522A1 (fr) | Environnement de développement pour la génération automatique de code à l'aide d'un modèle de métadonnées multiniveau | |
| US12450426B1 (en) | Method and system for cellular computation and display | |
| US20240320224A1 (en) | Logical Access for Previewing Expanded View Datasets | |
| WO2025217363A1 (fr) | Guidage d'un modèle d'apprentissage automatique dans la génération de règles pour le traitement de données | |
| US20250335448A1 (en) | Metadata Change Triggers | |
| AU2024241329A1 (en) | Logical access for previewing expanded view datasets | |
| WO2025226952A1 (fr) | Déclencheurs de changement de métadonnées | |
| HK1190518B (en) | Generating data pattern information |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24847304 Country of ref document: EP Kind code of ref document: A1 |