Pivot Tables are one of the most powerful tools provided in MS Excel and Pandas offers the same. We can create custom hierarchies, aggregate data, play around with rows and columns, and create custom views for our data to better represent our analysis. In this article, I will go over the fundamentals of creating Pivot Tables using Pandas and also explain some of its use-cases. So let’s get started!
Import our data
Before we proceed with our Pivot Tables, let’s quickly take a look at our data.
>>> sales.head()
Date Name Customer Revenue Expenses
0 2020-01-01 Oscar Logistics XYZ 5250 531
1 2020-01-01 Oscar Money Corp. 4406 661
2 2020-01-02 Oscar PaperMaven 8661 1401
3 2020-01-03 Oscar PaperGenius 7075 906
4 2020-01-04 Oscar Paper Pound 2524 1767
>>> sales.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 26 non-null datetime64[ns]
1 Name 26 non-null object
2 Customer 26 non-null object
3 Revenue 26 non-null int64
4 Expenses 26 non-null int64
dtypes: datetime64[ns](1), int64(2), object(2)
memory usage: 1.1+ KB
- We have 26 rows and 5 columns of sales data.
- The data is broken out by Date, Name and Customer and it records Revenue and Expenses.
- Looking at the heirarchy, we can see that an employee (Name) can generate multiple Revenue and Expenses on multiple days and on a single day multiple emplyees (Name) can also generate multiple Revenue and Expenses.
Pivot Table Syntax
Let’s take a look at the syntax before we actually start using our Pivot Tables.
The complete documentation can be found here.
Parameter | Description |
---|---|
data | DataFrame |
values | column to aggregate, optional |
index | column, Grouper, array, or list of the previous If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table index. If an array is passed, it is being used in the same manner as column values. |
columns | column, Grouper, array, or list of the previous If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table column. If an array is passed, it is being used in the same manner as column values. |
aggfunc | function, list of functions, dict, default numpy.mean If a list of functions is passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves) If the dict is passed, the key is a column to aggregate and the value is function or list of functions. |
fill_value | scalar, default None Value to replace missing values with (in the resulting pivot table, after aggregation). |
margins | bool, default False Add all row / columns (e.g. for subtotal / grand totals). |
dropna | bool, default True Do not include columns whose entries are all NaN. |
margins_name | str, default ‘All’ Name of the row/column that will contain the totals when margins are True. |
We will go over all the use-cases for the different parameters and see what those mean in practice.
Pivoting our data
The general steps for creating a pivot table are:
- Selecting the columns we need group together.
- Selecting the columns we need to aggregate together.
- Define the aggregation we want to apply.
- Defining our rows and columns for the final view.
Example 1:
Let’s create a pivot table with Date and Name as indexes, summing both Revenue and Expenses.
>>> sales.pivot_table(
... index=['Date', 'Name'],
... aggfunc='sum',
... ).head()
Expenses Revenue
Date Name
2020-01-01 Creed 548 4430
Dwight 368 2639
Jim 1305 1864
Michael 412 7172
Oscar 1192 9656
Let’s walk through the code:
- We provided a list of column names as an argument to our index parameter, since we need both the columns as indexes.
- The sequence in which these are defined is the sequence they will be arranged in your output
- Since we want to aggregate all of our values we don’t need to specify an argument for values parameter. By default, it takes all the numeric columns.
- Instead of getting the mean of our values (which is the default aggregation), we want to get the sum of our values. That’s why we passed “sum” as an argument to the aggfunc parameter.
Example 2:
Let’s create a pivot table with Date as our index, Name as our column, summing up only Revenue.
>>> sales.pivot_table(
... index = 'Date',
... columns = 'Name',
... values = 'Revenue',
... aggfunc = 'sum',
... )
Name Creed Dwight Jim Michael Oscar
Date
2020-01-01 4430.0 2639.0 1864.0 7172.0 9656.0
2020-01-02 13214.0 NaN 8278.0 6362.0 8661.0
2020-01-03 NaN 11912.0 4226.0 5982.0 7075.0
2020-01-04 3144.0 NaN 6155.0 7917.0 2524.0
2020-01-05 938.0 7771.0 NaN 7837.0 2793.0
Let’s walk through the code:
- We added columns parameter and set it to Name since we needed Name in our columns.
- We specified that we only want Revenue as the output value by passing that to values parameter.
- Worth mentioning that Pandas added NaN values where there was no data available.
Let’s add a Total to our above output by setting margins to True.
>>> sales.pivot_table(
... index = 'Date',
... columns = 'Name',
... values = 'Revenue',
... aggfunc = 'sum',
... margins = True,
... )
Name Creed Dwight Jim Michael Oscar All
Date
2020-01-01 00:00:00 4430.0 2639.0 1864.0 7172.0 9656.0 25761
2020-01-02 00:00:00 13214.0 NaN 8278.0 6362.0 8661.0 36515
2020-01-03 00:00:00 NaN 11912.0 4226.0 5982.0 7075.0 29195
2020-01-04 00:00:00 3144.0 NaN 6155.0 7917.0 2524.0 19740
2020-01-05 00:00:00 938.0 7771.0 NaN 7837.0 2793.0 19339
All 21726.0 22322.0 20523.0 35270.0 30709.0 130550
Let’s walk through the code:
- A total column and row was added with a default label “All”.
- Note that the Date data type was now changed to include a string type All.
Let’s modify the total label by setting the margins_name parameter to the desired value.
>>> sales.pivot_table(
... index = 'Date',
... columns = 'Name',
... values = 'Revenue',
... aggfunc = 'sum',
... margins = True,
... margins_name = 'Total',
... )
Name Creed Dwight Jim Michael Oscar Total
Date
2020-01-01 00:00:00 4430.0 2639.0 1864.0 7172.0 9656.0 25761
2020-01-02 00:00:00 13214.0 NaN 8278.0 6362.0 8661.0 36515
2020-01-03 00:00:00 NaN 11912.0 4226.0 5982.0 7075.0 29195
2020-01-04 00:00:00 3144.0 NaN 6155.0 7917.0 2524.0 19740
2020-01-05 00:00:00 938.0 7771.0 NaN 7837.0 2793.0 19339
Total 21726.0 22322.0 20523.0 35270.0 30709.0 130550
Let’s fill up the Nan values with a defined value by setting the fill_value parameter to the desired value.
>>> sales.pivot_table(
... index = 'Date',
... columns = 'Name',
... values = 'Revenue',
... aggfunc = 'sum',
... margins = True,
... margins_name = 'Total',
... fill_value = 0,
... )
Name Creed Dwight Jim Michael Oscar Total
Date
2020-01-01 00:00:00 4430 2639 1864 7172 9656 25761
2020-01-02 00:00:00 13214 0 8278 6362 8661 36515
2020-01-03 00:00:00 0 11912 4226 5982 7075 29195
2020-01-04 00:00:00 3144 0 6155 7917 2524 19740
2020-01-05 00:00:00 938 7771 0 7837 2793 19339
Total 21726 22322 20523 35270 30709 130550
Example 3
Let’s create a similar structure as our example 1 but instead of summing both the values, we need to return the sum and mean of Expense
>>> sales.pivot_table(
... index = ['Date', 'Name'],
... values = 'Expenses',
... aggfunc = ['sum', 'mean'],
... ).head()
sum mean
Expenses Expenses
Date Name
2020-01-01 Creed 548 548.0
Dwight 368 368.0
Jim 1305 1305.0
Michael 412 412.0
Oscar 1192 596.0
Let’s walk through the code:
- In the aggfunc parameter, we passed in a list of the aggregation funcitons we want to apply as our argument.
- Pandas then applies these functions to the selected values.
Example 4
Let’s create a similar structure as above but now return both the values with different aggregation functions, the sum for the Expenses and mean for the Revenue
>>> sales.pivot_table(
... index = ['Date', 'Name'],
... aggfunc = {
... 'Expenses': 'sum',
... 'Revenue': 'mean'
... },
... ).head()
Expenses Revenue
Date Name
2020-01-01 Creed 548 4430.0
Dwight 368 2639.0
Jim 1305 1864.0
Michael 412 7172.0
Oscar 1192 4828.0
Let’s walk through the code:
- We can also provide a dictionary as an argument to our aggfunc parameter.
- The dictionary contains a key-value pair of the column name and the aggregation function we want to apply to that column.
The above examples cover the majority of the use cases you may find for pivot tables. I will update the list as I come across any worth mentioning use cases. Please note, you can unpivot the tables using the melt method as well. The idea behind unpivoting a table is pretty similar, just the opposite way. Feel free to look at the documentation here.