October 14, 2024

Pivot Tables: Let’s get serious with our tables in Pandas

Spread the love

Pivot Tables are one of the most powerful tools provided in MS Excel and Pandas offers the same. We can create custom hierarchies, aggregate data, play around with rows and columns, and create custom views for our data to better represent our analysis. In this article, I will go over the fundamentals of creating Pivot Tables using Pandas and also explain some of its use-cases. So let’s get started!


Import our data

Before we proceed with our Pivot Tables, let’s quickly take a look at our data.

>>> sales.head()
        Date   Name       Customer  Revenue  Expenses
0 2020-01-01  Oscar  Logistics XYZ     5250       531
1 2020-01-01  Oscar    Money Corp.     4406       661
2 2020-01-02  Oscar     PaperMaven     8661      1401
3 2020-01-03  Oscar    PaperGenius     7075       906
4 2020-01-04  Oscar    Paper Pound     2524      1767

>>> sales.info()
<class 'pandas.core.frame.DataFrame'>        
RangeIndex: 26 entries, 0 to 25
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   Date      26 non-null     datetime64[ns]
 1   Name      26 non-null     object
 2   Customer  26 non-null     object
 3   Revenue   26 non-null     int64
 4   Expenses  26 non-null     int64
dtypes: datetime64[ns](1), int64(2), object(2)
memory usage: 1.1+ KB
  • We have 26 rows and 5 columns of sales data.
  • The data is broken out by Date, Name and Customer and it records Revenue and Expenses.
  • Looking at the heirarchy, we can see that an employee (Name) can generate multiple Revenue and Expenses on multiple days and on a single day multiple emplyees (Name) can also generate multiple Revenue and Expenses.

Pivot Table Syntax

Let’s take a look at the syntax before we actually start using our Pivot Tables.
The complete documentation can be found here.

ParameterDescription
dataDataFrame
valuescolumn to aggregate, optional
indexcolumn, Grouper, array, or list of the previous
If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table index. If an array is passed, it is being used in the same manner as column values.
columnscolumn, Grouper, array, or list of the previous
If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table column. If an array is passed, it is being used in the same manner as column values.
aggfuncfunction, list of functions, dict, default numpy.mean
If a list of functions is passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves) If the dict is passed, the key is a column to aggregate and the value is function or list of functions.
fill_valuescalar, default None
Value to replace missing values with (in the resulting pivot table, after aggregation).
marginsbool, default False
Add all row / columns (e.g. for subtotal / grand totals).
dropnabool, default True
Do not include columns whose entries are all NaN.
margins_namestr, default ‘All’
Name of the row/column that will contain the totals when margins are True.

We will go over all the use-cases for the different parameters and see what those mean in practice.


Pivoting our data

The general steps for creating a pivot table are:

  • Selecting the columns we need group together.
  • Selecting the columns we need to aggregate together.
  • Define the aggregation we want to apply.
  • Defining our rows and columns for the final view.

Example 1:

Let’s create a pivot table with Date and Name as indexes, summing both Revenue and Expenses.

>>> sales.pivot_table(
...     index=['Date', 'Name'],
...     aggfunc='sum',
... ).head()
                    Expenses  Revenue
Date       Name
2020-01-01 Creed         548     4430
           Dwight        368     2639
           Jim          1305     1864
           Michael       412     7172
           Oscar        1192     9656

Let’s walk through the code:

  • We provided a list of column names as an argument to our index parameter, since we need both the columns as indexes.
    • The sequence in which these are defined is the sequence they will be arranged in your output
  • Since we want to aggregate all of our values we don’t need to specify an argument for values parameter. By default, it takes all the numeric columns.
  • Instead of getting the mean of our values (which is the default aggregation), we want to get the sum of our values. That’s why we passed “sum” as an argument to the aggfunc parameter.

Example 2:

Let’s create a pivot table with Date as our index, Name as our column, summing up only Revenue.

>>> sales.pivot_table(
...     index = 'Date',
...     columns = 'Name',
...     values = 'Revenue',
...     aggfunc = 'sum',
... )
Name          Creed   Dwight     Jim  Michael   Oscar
Date
2020-01-01   4430.0   2639.0  1864.0   7172.0  9656.0
2020-01-02  13214.0      NaN  8278.0   6362.0  8661.0
2020-01-03      NaN  11912.0  4226.0   5982.0  7075.0
2020-01-04   3144.0      NaN  6155.0   7917.0  2524.0
2020-01-05    938.0   7771.0     NaN   7837.0  2793.0

Let’s walk through the code:

  • We added columns parameter and set it to Name since we needed Name in our columns.
  • We specified that we only want Revenue as the output value by passing that to values parameter.
  • Worth mentioning that Pandas added NaN values where there was no data available.

Let’s add a Total to our above output by setting margins to True.

>>> sales.pivot_table(
...     index = 'Date',
...     columns = 'Name',
...     values = 'Revenue',
...     aggfunc = 'sum',
...     margins = True,
... )
Name                   Creed   Dwight      Jim  Michael    Oscar     All
Date
2020-01-01 00:00:00   4430.0   2639.0   1864.0   7172.0   9656.0   25761
2020-01-02 00:00:00  13214.0      NaN   8278.0   6362.0   8661.0   36515
2020-01-03 00:00:00      NaN  11912.0   4226.0   5982.0   7075.0   29195
2020-01-04 00:00:00   3144.0      NaN   6155.0   7917.0   2524.0   19740
2020-01-05 00:00:00    938.0   7771.0      NaN   7837.0   2793.0   19339
All                  21726.0  22322.0  20523.0  35270.0  30709.0  130550

Let’s walk through the code:

  • A total column and row was added with a default label “All”.
  • Note that the Date data type was now changed to include a string type All.

Let’s modify the total label by setting the margins_name parameter to the desired value.

>>> sales.pivot_table(
...     index = 'Date',
...     columns = 'Name',
...     values = 'Revenue',
...     aggfunc = 'sum',
...     margins = True,
...     margins_name = 'Total',
... )
Name                   Creed   Dwight      Jim  Michael    Oscar   Total
Date
2020-01-01 00:00:00   4430.0   2639.0   1864.0   7172.0   9656.0   25761
2020-01-02 00:00:00  13214.0      NaN   8278.0   6362.0   8661.0   36515
2020-01-03 00:00:00      NaN  11912.0   4226.0   5982.0   7075.0   29195
2020-01-04 00:00:00   3144.0      NaN   6155.0   7917.0   2524.0   19740
2020-01-05 00:00:00    938.0   7771.0      NaN   7837.0   2793.0   19339
Total                21726.0  22322.0  20523.0  35270.0  30709.0  130550

Let’s fill up the Nan values with a defined value by setting the fill_value parameter to the desired value.

>>> sales.pivot_table(
...     index = 'Date',
...     columns = 'Name',
...     values = 'Revenue',
...     aggfunc = 'sum',
...     margins = True,
...     margins_name = 'Total',
...     fill_value = 0,
... )
Name                 Creed  Dwight    Jim  Michael  Oscar   Total
Date
2020-01-01 00:00:00   4430    2639   1864     7172   9656   25761
2020-01-02 00:00:00  13214       0   8278     6362   8661   36515
2020-01-03 00:00:00      0   11912   4226     5982   7075   29195
2020-01-04 00:00:00   3144       0   6155     7917   2524   19740
2020-01-05 00:00:00    938    7771      0     7837   2793   19339
Total                21726   22322  20523    35270  30709  130550

Example 3

Let’s create a similar structure as our example 1 but instead of summing both the values, we need to return the sum and mean of Expense

>>> sales.pivot_table(
...     index = ['Date', 'Name'],
...     values = 'Expenses',
...     aggfunc = ['sum', 'mean'],
... ).head()
                        sum     mean
                   Expenses Expenses
Date       Name
2020-01-01 Creed        548    548.0
           Dwight       368    368.0
           Jim         1305   1305.0
           Michael      412    412.0
           Oscar       1192    596.0

Let’s walk through the code:

  • In the aggfunc parameter, we passed in a list of the aggregation funcitons we want to apply as our argument.
  • Pandas then applies these functions to the selected values.

Example 4

Let’s create a similar structure as above but now return both the values with different aggregation functions, the sum for the Expenses and mean for the Revenue

>>> sales.pivot_table(
...     index = ['Date', 'Name'],
...     aggfunc = {
...         'Expenses': 'sum',
...         'Revenue': 'mean'
...     },
... ).head()
                    Expenses  Revenue
Date       Name
2020-01-01 Creed         548   4430.0
           Dwight        368   2639.0
           Jim          1305   1864.0
           Michael       412   7172.0
           Oscar        1192   4828.0

Let’s walk through the code:

  • We can also provide a dictionary as an argument to our aggfunc parameter.
  • The dictionary contains a key-value pair of the column name and the aggregation function we want to apply to that column.

The above examples cover the majority of the use cases you may find for pivot tables. I will update the list as I come across any worth mentioning use cases. Please note, you can unpivot the tables using the melt method as well. The idea behind unpivoting a table is pretty similar, just the opposite way. Feel free to look at the documentation here.


Spread the love

Leave a Reply

Your email address will not be published. Required fields are marked *