Movies Extract Transform Load (ETL)

Executive Overview

This project was quite tedious and pretty lengthy. However, this was a great example of what the ETL process looks like for most Data Scientist positions. I will never forget my first Data Analytics lesson from Professor Gour Saha, "Data is Dirty". He couldn't be more accurate and while most analysts are chomping at the bit to perform all the sexy stuff like, data visuals and machine learning, ETL is vital and this is the task that will take up most of our time.

In completing this assignment, I truly appreciated the use of regular expressions (RegEx) and functions in creating an automated ETL pipeline. Unsurprisingly, the Pandas library was also quite useful in this process. Admittedly, I definitely need more practice in using some of the transformation techniques demonstrated in this project. However, I highlighted some of the techniques that I believe will be useful in the work environment.

Objectives

Create an automated ETL pipeline
Extract movies data from multiple sources
Clean and transform the data automatically using Pandas and RegEx
Load new data into PostgreSQL

Data

Wikipedia data
Kaggle metadata
MovieLens rating data (from Kaggle)

Resources

Python
RegEx
Pandas
SQLAlchemy
PostgreSQL

ETL Process

Extraction

This is how the data was extracted.

Transformation

Used list comprehension to filter data and Pandas to convert movies file to a dataframe.

Created an empty dictionary, loops and column name consolidation.

Performed cleaning techniques such as:

Removal of duplicate rows
Smart deletion of columns with high level of missing values
Dropping of NA values
Using lambda function

Used Regex to perform some cleaning of the data.

Other cleaning techniques demonstrated:

Parsing through data values and applying changes
Dropping unnecessary columns
Creating new variables
Converting lists to strings

Performed other cleaning techniques, including:

to_numeric() method
Pandas built-in datetime() function

Loading

Connecting Pandas and SQL to load dataframe.

View full python code for ETL-process

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
Images		Images
Challenge.ipynb		Challenge.ipynb
Clean_Practice.ipynb		Clean_Practice.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Movies Extract Transform Load (ETL)

Executive Overview

Objectives

Data

Resources

ETL Process

Extraction

Transformation

Loading

About

Releases

Packages

Languages

GR8505/Movies-ETL

Folders and files

Latest commit

History

Repository files navigation

Movies Extract Transform Load (ETL)

Executive Overview

Objectives

Data

Resources

ETL Process

Extraction

Transformation

Loading

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages