End to End SF311 Data Pipeline

Project Overview: Enhancing Civic Insights with SF311 Data

This project is dedicated to the strategic application of Data Engineering methodologies for in-depth analysis of SF311 data, a vital civic service in San Francisco. The project's core technical objectives encompass:

Data Extraction: Efficient retrieval of comprehensive SF311 data through API integration.
Data Ingestion and Storage: Organizing the raw data in a Data Lake, establishing a centralized repository in Google Cloud Storage (GCS) for accessibility and preservation, and further uploading it to BigQuery.
Data Transformation: Leveraging dbt (data build tool) for high-level transformations, structuring, cleaning, and processing the data to enhance data quality and facilitate in-depth analysis.
Data Visualization: Employing powerful visualization tools and techniques to present meaningful insights in an accessible and actionable manner.

The Data Warehouse architecture is meticulously designed with three essential layers:

Raw Data Ingestion: Utilizing BigQuery for direct interaction with the data stored in GCS, ensuring the availability of raw data.
Data Preparation Sub-layer: Within the Data Transformation Layer, we create staging tables for a specific date, preparing the data for further processing.
Data Integration Sub-layer: After staging, data is loaded into destination tables using dbt's incremental strategy, which includes data cleaning and deduplication steps. This sub-layer ensures data quality and prepares it for aggregation.
Data Aggregation and Metrics Calculation: The core layer focuses on aggregating and analyzing the cleaned data to calculate various metrics and generate valuable insights.

Orchestration is seamlessly managed using Apache Airflow, ensuring a smooth, automated data pipeline from data extraction to transformation.

This project's technical focus aims to empower stakeholders with advanced Data Engineering capabilities, enabling the extraction of valuable civic insights from SF311 data. These insights are instrumental in enhancing municipal services and contributing to the well-being of the community.

Dataset

The dataset chosen for my data engineering project is derived from SF311, a municipal service formally established by the City and County of San Francisco. SF311 serves as a pivotal communications channel, seamlessly connecting residents, businesses, and visitors with a dedicated team of Customer Service Representatives. This service plays a fundamental role in facilitating a wide spectrum of government-related inquiries and services, ranging from reporting issues related to public spaces and infrastructure, to seeking information on various administrative matters.

Technologies and Tools

For this project I decided to use the following tools:

Infrastructure as Code: Terraform
Workflow orchestration: Airflow
VM instance to run the whole data pipeline: Google Compute Engine
Data Lake: Google Cloud Storage
Data Warehouse: Google BigQuery
Data Transformation: DBT
Data Visualization: Google Looker Studio

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
Images		Images
airflow		airflow
dbt_project		dbt_project
infra		infra
.env		.env
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
profiles.yml		profiles.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End to End SF311 Data Pipeline

Project Overview: Enhancing Civic Insights with SF311 Data

Dataset

Technologies and Tools

Result

About

Releases

Packages

Languages

popolee0513/SF311-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

End to End SF311 Data Pipeline

Project Overview: Enhancing Civic Insights with SF311 Data

Dataset

Technologies and Tools

Result

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages