PySpark Project: Using PySpark for some basic ETL process

Spread the love

While doing a Tableau course, I came across this ETL process where I had to combine multiple CSV files and pivot them. Unfortunately, Tableau doesn’t allow us to pivot data that comes from different connections. The quickest way I could think of doing that was using Python, that’s how I ended up creating this project. There are multiple ways to tackle this within Python, I chose using PySpark since this can be easily scalable and is something new that I am trying to learn.

Project Requirement

We have multiple CSV files with Date and stock values.
We are only interested in the Date and Adj Close value.
We need to merge all the files with date, company, and Adj Close values.

Process Steps

Read all the files into a list of PySpark DataFrames.
Create a function for selecting the desired columns and renaming the column depending on the file.
Join all the files into one DataFrame.
Unpivot the data.

Some additional notes:

I have used python list comprehensions for creating lists of DataFrames.
I have used Python Type Hints within functions.
I have used the “reduce” function to iterate over a list, resulting in one DataFrame in the end.

Original Data

>>> apple_df.show(5, False)
+----------+----------+----------+----------+----------+----------+--------+
|Date      |Open      |High      |Low       |Close     |Adj Close |Volume  |
+----------+----------+----------+----------+----------+----------+--------+
|2018-08-27|217.149994|218.740005|216.330002|217.940002|214.609741|20525100|
|2018-08-28|219.009995|220.539993|218.919998|219.699997|216.34285 |22776800|
|2018-08-29|220.149994|223.490005|219.410004|222.979996|219.572723|27254800|
|2018-08-30|223.25    |228.259995|222.399994|225.029999|221.591385|48793800|
|2018-08-31|226.509995|228.869995|226.0     |227.630005|224.151657|43340100|
+----------+----------+----------+----------+----------+----------+--------+
only showing top 5 rows

Final Data

>>> stocks.show(10, False)
+----------+---------+-----------+
|Date      |Company  |Price      |
+----------+---------+-----------+
|2018-08-27|apple    |214.609741 |
|2018-08-27|bmw      |79.018585  |
|2018-08-27|tm       |124.692833 |
|2018-08-27|google   |1241.819946|
|2018-08-27|microsoft|107.904411 |
|2018-08-28|apple    |216.34285  |
|2018-08-28|bmw      |80.158409  |
|2018-08-28|tm       |124.298454 |
|2018-08-28|google   |1231.150024|
|2018-08-28|microsoft|108.554207 |
+----------+---------+-----------+
only showing top 10 rows

You can find the link to the code on my GitHub profile here.

Spread the love