Integrate with OpenLineage

OpenLineage is an open platform for collecting and analyzing data lineage information. Using an open standard for lineage data, OpenLineage captures lineage events from data pipeline components which use an OpenLineage API to report on runs, jobs, and datasets.

Through the Data Lineage API, you can import OpenLineage events to display in the Dataplex Universal Catalog web interface alongside lineage information from Google Cloud services, such as BigQuery, Cloud Composer, Cloud Data Fusion, and Dataproc.

To import OpenLineage events that use the OpenLineage specification, use the ProcessOpenLineageRunEvent REST API method, and map OpenLineage facets to Data Lineage API attributes.

Limitations

  • The Data Lineage API supports OpenLineage major version 1.

  • The Data Lineage API endpoint ProcessOpenLineageRunEvent only acts as a consumer of OpenLineage messages, not producer. The API lets you send lineage information generated by any OpenLineage-compliant tool or system into Dataplex Universal Catalog. Some Google Cloud services, such as Dataproc and Cloud Composer, include built-in OpenLineage producers that can send events to this endpoint, automating lineage capture from those services.

  • The Data Lineage API doesn't support the following:

    • Any subsequent OpenLineage release with message format changes
    • DatasetEvent
    • JobEvent
  • Maximum size of a single message is 5 MB.

  • Length of each Fully Qualified Name in inputs and outputs is limited to 4000 characters.

  • Links are grouped by events with 100 links. The maximum aggregate number of links is 1000.

  • Dataplex Universal Catalog displays a lineage graph for each job run, showing the inputs and outputs of lineage events. It doesn't support lower-level processes like Spark stages.

OpenLineage mapping

The REST API method ProcessOpenLineageRunEvent maps OpenLineage attributes to Data Lineage API attributes as follows:

Data Lineage API attributes OpenLineage attributes
Process.name projects/PROJECT_NUMBER/locations/LOCATION/processes/HASH_OF_NAMESPACE_AND_NAME
Process.displayName Job.namespace + ":" + Job.name
Process.attributes Job.facets (see Stored data)
Run.name projects/PROJECT_NUMBER/locations/LOCATION/processes/HASH_OF_NAMESPACE_AND_NAME/runs/HASH_OF_RUNID
Run.displayName Run.runId
Run.attributes Run.facets (see Stored data)
Run.startTime eventTime
Run.endTime eventTime
Run.state eventType
LineageEvent.name projects/PROJECT_NUMBER/locations/LOCATION/processes/HASH_OF_NAMESPACE_AND_NAME/runs/HASH_OF_RUNID/lineageEvents/HASH_OF_JOB_RUN_INPUT_OUTPUTS_OF_EVENT (for example, projects/11111111/locations/us/processes/1234/runs/4321/lineageEvents/111-222-333)
LineageEvent.EventLinks.source inputs (fqn is namespace and name concatenation)
LineageEvent.EventLinks.target outputs (fqn is namespace and name concatenation)
LineageEvent.startTime eventTime
LineageEvent.endTime eventTime
requestId Defined by the method user

Import an OpenLineage event

If you haven't yet set up OpenLineage, see Getting started.

To import an OpenLineage event into Dataplex Universal Catalog, call the REST API method ProcessOpenLineageRunEvent:

POST https://datalineage.googleapis.com/v1/projects/{project}/locations/{location}:processOpenLineageRunEvent \
--data '{"eventTime":"2023-04-04T13:21:16.098Z","eventType":"COMPLETE","inputs":[{"name":"somename","namespace":"somenamespace"}],"job":{"name":"somename","namespace":"somenamespace"},"outputs":[{"name":"somename","namespace":"somenamespace"}],"producer":"someproducer","run":{"runId":"somerunid"},"schemaURL":"https://openlineage.io/spec/1-0-5/OpenLineage.json#/$defs/RunEvent"}'

Tools for sending OpenLineage messages

To simplify sending events to the Data Lineage API, you can use various tools and libraries:

  • Google Cloud Java Producer Library: Google provides an open-source Java library to help construct and send OpenLineage events to the Data Lineage API. For more information, see the blog post Producer java library for Data Lineage is now open source. The library is available on GitHub and Maven.
  • OpenLineage GCP Transport: For Java-based OpenLineage producers, a dedicated GcpLineage Transport is available. It simplifies integration with Data Lineage API, by minimizing the code needed for sending events to Data Lineage API. The GcpLineageTransport can be configured as the event sink for any existing OpenLineage producer such as Airflow, Spark, and Flink. For more information and examples, see GcpLineage.

Analyze information from OpenLineage

To analyze the imported OpenLineage events, see View lineage graphs in Dataplex Universal Catalog UI.

Stored data

The Data Lineage API doesn't store all facets data from the OpenLineage messages. The Data Lineage API stores the following facet fields:

  • spark_version
    • openlineage-spark-version
    • spark-version
  • all spark.logicalPlan.*
  • environment-properties (custom Google Cloud lineage facet)
    • origin.sourcetype and origin.name
    • spark.app.id
    • spark.app.name
    • spark.batch.id
    • spark.batch.uuid
    • spark.cluster.name
    • spark.cluster.region
    • spark.job.id
    • spark.job.uuid
    • spark.project.id
    • spark.query.node.name
    • spark.session.id
    • spark.session.uuid

The Data Lineage API stores the following information:

  • eventTime
  • run.runId
  • job.namespace
  • job.name

What's next?