October 14, 2024

PySpark Project: Using PySpark for some basic ETL process

Spread the love

While doing a Tableau course, I came across this ETL process where I had to combine multiple CSV files and pivot them. Unfortunately, Tableau doesn’t allow us to pivot data that comes from different connections. The quickest way I could think of doing that was using Python, that’s how I ended up creating this project. There are multiple ways to tackle this within Python, I chose using PySpark since this can be easily scalable and is something new that I am trying to learn.


Project Requirement

  • We have multiple CSV files with Date and stock values.
  • We are only interested in the Date and Adj Close value.
  • We need to merge all the files with date, company, and Adj Close values.

Process Steps

  • Read all the files into a list of PySpark DataFrames.
  • Create a function for selecting the desired columns and renaming the column depending on the file.
  • Join all the files into one DataFrame.
  • Unpivot the data.

Some additional notes:

  • I have used python list comprehensions for creating lists of DataFrames.
  • I have used Python Type Hints within functions.
  • I have used the “reduce” function to iterate over a list, resulting in one DataFrame in the end.

Original Data

>>> apple_df.show(5, False)
+----------+----------+----------+----------+----------+----------+--------+
|Date      |Open      |High      |Low       |Close     |Adj Close |Volume  |
+----------+----------+----------+----------+----------+----------+--------+
|2018-08-27|217.149994|218.740005|216.330002|217.940002|214.609741|20525100|
|2018-08-28|219.009995|220.539993|218.919998|219.699997|216.34285 |22776800|
|2018-08-29|220.149994|223.490005|219.410004|222.979996|219.572723|27254800|
|2018-08-30|223.25    |228.259995|222.399994|225.029999|221.591385|48793800|
|2018-08-31|226.509995|228.869995|226.0     |227.630005|224.151657|43340100|
+----------+----------+----------+----------+----------+----------+--------+
only showing top 5 rows

Final Data

>>> stocks.show(10, False)
+----------+---------+-----------+
|Date      |Company  |Price      |
+----------+---------+-----------+
|2018-08-27|apple    |214.609741 |
|2018-08-27|bmw      |79.018585  |
|2018-08-27|tm       |124.692833 |
|2018-08-27|google   |1241.819946|
|2018-08-27|microsoft|107.904411 |
|2018-08-28|apple    |216.34285  |
|2018-08-28|bmw      |80.158409  |
|2018-08-28|tm       |124.298454 |
|2018-08-28|google   |1231.150024|
|2018-08-28|microsoft|108.554207 |
+----------+---------+-----------+
only showing top 10 rows

You can find the link to the code on my GitHub profile here.


Spread the love

One thought on “PySpark Project: Using PySpark for some basic ETL process

Leave a Reply

Your email address will not be published. Required fields are marked *