This project was quite tedious and pretty lengthy. However, this was a great example of what the ETL process looks like for most Data Scientist positions. I will never forget my first Data Analytics lesson from Professor Gour Saha, "Data is Dirty". He couldn't be more accurate and while most analysts are chomping at the bit to perform all the sexy stuff like, data visuals and machine learning, ETL is vital and this is the task that will take up most of our time.
In completing this assignment, I truly appreciated the use of regular expressions (RegEx) and functions in creating an automated ETL pipeline. Unsurprisingly, the Pandas library was also quite useful in this process. Admittedly, I definitely need more practice in using some of the transformation techniques demonstrated in this project. However, I highlighted some of the techniques that I believe will be useful in the work environment.
- Create an automated ETL pipeline
- Extract movies data from multiple sources
- Clean and transform the data automatically using Pandas and RegEx
- Load new data into PostgreSQL
- Wikipedia data
- Kaggle metadata
- MovieLens rating data (from Kaggle)
- Python
- RegEx
- Pandas
- SQLAlchemy
- PostgreSQL
This is how the data was extracted.
Used list comprehension to filter data and Pandas to convert movies file to a dataframe.
Created an empty dictionary, loops and column name consolidation.
Performed cleaning techniques such as:
- Removal of duplicate rows
- Smart deletion of columns with high level of missing values
- Dropping of NA values
- Using lambda function
Used Regex to perform some cleaning of the data.
Other cleaning techniques demonstrated:
- Parsing through data values and applying changes
- Dropping unnecessary columns
- Creating new variables
- Converting lists to strings
Performed other cleaning techniques, including:
- to_numeric() method
- Pandas built-in datetime() function
Connecting Pandas and SQL to load dataframe.
View full python code for ETL-process