October 14, 2024

GroupBy: One of the most powerful methods available in Pandas

Spread the love

The data usually contains multiple levels of hierarchies, which can be grouped in multiple ways for our analysis. Pandas offer a method for tackling such situations.


Let’s look at our data

Before we proceed with the article, let’s quickly take a look at our data

>>> fortune.head()
              Company  Revenues  Profits  Employees       Sector                                  Industry
0             Walmart  500343.0   9862.0    2300000    Retailing                     General Merchandisers
1         Exxon Mobil  244363.0  19710.0      71200       Energy                        Petroleum Refining
2  Berkshire Hathaway  242137.0  44940.0     377000   Financials  Insurance: Property and Casualty (Stock)
3               Apple  229234.0  48351.0     123000   Technology               Computers, Office Equipment
4  UnitedHealth Group  201159.0  10558.0     260000  Health Care   Health Care: Insurance and Managed Care

>>> fortune.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   Company    1000 non-null   object
 1   Revenues   1000 non-null   float64
 2   Profits    998 non-null    float64
 3   Employees  1000 non-null   int64
 4   Sector     1000 non-null   object
 5   Industry   1000 non-null   object
dtypes: float64(2), int64(1), object(3)
memory usage: 47.0+ KB
  • The data contains 1000 rows and 6 columns.
  • We have Company Data broken out by Company, Sector and Industry. The available measures are Revenues, Profits and Employees.
  • Looking at the data we can understand that a Sector can have multiple Industry and an Industry can have multiple Company.
  • So the hierarchy is Sector -> Industry -> Company.

General Syntax for GroupBy

Let’s look at the parameters for GroupBy
You can find the full documentation here.

ParameterDescription
bymapping, function, label, or list of labels
Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups. If a ndarray is passed, the values are used as-is to determine the groups. A label or list of labels may be passed to the group by the columns in self. Notice that a tuple is interpreted as a (single) key.
axis{0 or ‘index’, 1 or ‘columns’}, default 0
Split along rows (0) or columns (1).
levelint, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
as_indexbool, default True
For aggregated output, return an object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.
sortbool, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.
group_keysbool, default True
When calling apply, add group keys to the index to identify pieces.
observedbool, default False
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
dropnabool, default True
If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups

Grouping data by a column

In our data, since Sector is that main header we can think of grouping the data by Sector and then try to do some analysis

sectors = fortune.groupby('Sector')
>>> sectors
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000017215A736D0>

Grouping a dataset using groupby returns a DataFrameGroupBy object, which provides different methods and properties which we can utilize for our analysis.


Methods and Properties of DataFrameGroupBy Object

Let’s look at some of the properties and methods for our DataFrameGroupBy object

Methods / PropertiesDescription
.size()returns a Series object with grouped elements as index and count of unique occurrences as labels
.groupsreturns a dictionary with keys as the grouped elements and values as a list of indexes where those elements are present in our dataset
.get_group()name: object
The name of the group to get as a DataFrame
obj: DataFrame, default None
The DataFrame to take the DataFrame out of. If it is None, the object groupby was called on will be used
returns:
A group object with the group name
.first()numeric_only: bool, default False
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.
min_count: int, default -1
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
returns:
a Series or DataFrame with first of values within each group
.last() numeric_only: bool, default False
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.
min_count: int, default -1
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
returns:
a Series or DataFrame with last of values within each group
.nth()n: int or list of ints
A single nth value for the row or a list of nth values
dropna: {‘any’, ‘all’, None}, default None
Apply the specified dropna operation before counting which row is the nth row
returns:
a Series or DataFrame with nth value within each group

Let’s look at some of the methods and properties:

>>> sectors.size()
Sector
Aerospace & Defense               25
Apparel                           14
Business Services                 53
Chemicals                         33
Energy                           107
Engineering & Construction        27
Financials                       155
Food &  Drug Stores               12
Food, Beverages & Tobacco         37
Health Care                       71
Hotels, Restaurants & Leisure     26
Household Products                28
Industrials                       49
Materials                         45
Media                             25
Motor Vehicles & Parts            19
Retailing                         77
Technology                       103
Telecommunications                10
Transportation                    40
Wholesalers                       44
dtype: int64

>>> sectors.get_group('Apparel').head()
          Company  Revenues  Profits  Employees   Sector Industry
88           Nike   34350.0   4240.0      74400  Apparel  Apparel
241            VF   12400.0    614.9      69000  Apparel  Apparel
331           PVH    8915.0    537.8      28050  Apparel  Apparel
420  Ralph Lauren    6653.0    -99.3      18250  Apparel  Apparel
432   Hanesbrands    6478.0     61.9      67200  Apparel  Apparel

>>> sectors.first().head()
                           Company  Revenues  Profits  Employees               Industry
Sector
Aerospace & Defense         Boeing   93392.0   8197.0     140800  Aerospace and Defense
Apparel                       Nike   34350.0   4240.0      74400                Apparel
Business Services    ManpowerGroup   21034.0    545.4      29000         Temporary Help
Chemicals                DowDuPont   62683.0   1460.0      98000              Chemicals
Energy                 Exxon Mobil  244363.0  19710.0      71200     Petroleum Refining


Aggregate Operations

GroupBy object provides various aggregation methods, some of them are mentioned below:

Please Note: We can also use apply method similar to a DataFrame object. For a GroupBy object, the function is applied to individual groups.

MethodDescription
.sum() numeric_only: bool, default True
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.
min_count: int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
returns:
a Series or DataFrame with the sum of values within each group
.mean()numeric_only: bool, default True
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.
returns:
a Series or DataFrame with the mean of values within each group
.min()numeric_only: bool, default False
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.
min_count: int, default -1
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
returns:
a Series or DataFrame with the min of values within each group
.max()numeric_only: bool, default False
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.
min_count: int, default -1
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
returns:
a Series or DataFrame with the max of values within each group
.agg()func: function, str, list or dict
Function to use for aggregating the data. If a function, must either work when passed a Series or when passed to Series.apply.
Accepted combinations are:
– function
– string function name
– list of functions and/or function names, e.g. [np.sum, ‘mean’]
– dict of axis labels -> functions, function names or list of such. eg. {“column_name”: “sum”}

Let’s look at some examples:

>>> sectors.sum().head()
                      Revenues  Profits  Employees
Sector
Aerospace & Defense   383835.0  26733.5    1010124
Apparel               101157.3   6350.7     355699
Business Services     316090.0  37179.2    1593999
Chemicals             251151.0  20475.0     474020
Energy               1543507.2  85369.6     981207

>>> sectors.mean().head()
                         Revenues      Profits     Employees
Sector
Aerospace & Defense  15353.400000  1069.340000  40404.960000
Apparel               7225.521429   453.621429  25407.071429
Business Services     5963.962264   701.494340  30075.452830
Chemicals             7610.636364   620.454545  14364.242424
Energy               14425.300935   805.373585   9170.158879

aggregation_function = {
    'Revenues': 'sum',
    'Profits': 'max',
    'Employees': 'mean',
}
>>> sectors.agg(aggregation_function).head()
                      Revenues  Profits     Employees
Sector
Aerospace & Defense   383835.0   8197.0  40404.960000
Apparel               101157.3   4240.0  25407.071429
Business Services     316090.0   6699.0  30075.452830
Chemicals             251151.0   3000.4  14364.242424
Energy               1543507.2  19710.0   9170.158879

Grouping by multiple columns

It’s also possible to group the DataFrame by multiple columns. Pandas will create a multi-index Series/DataFrame. Let’s group Sector and Industry together and see the result:

sector_and_industry = fortune.groupby(by=['Sector', 'Industry'])
>>> sector_and_industry.size()
Sector               Industry
Aerospace & Defense  Aerospace and Defense                            25
Apparel              Apparel                                          14
Business Services    Advertising, marketing                            2
                     Diversified Outsourcing Services                 14
                     Education                                         2
                                                                      ..
Transportation       Trucking, Truck Leasing                          11
Wholesalers          Wholesalers: Diversified                         24
                     Wholesalers: Electronics and Office Equipment     8
                     Wholesalers: Food and Grocery                     6
                     Wholesalers: Health Care                          6
Length: 82, dtype: int64
  • Here we can see that the Sector is further divided into Industry.
  • The sequence by which the DataFrame is grouped is very important, as that is the heirarchy in which our output is displayed.

Since it’s a multi-index object, we need to provide a tuple with multiple grouping keys to get_group.

>>> sector_and_industry.get_group(('Wholesalers', 'Wholesalers: Health Care'))
               Company  Revenues  Profits  Employees       Sector                  Industry
5             McKesson  198533.0   5070.0      64500  Wholesalers  Wholesalers: Health Care
11   AmerisourceBergen  153144.0    364.5      19500  Wholesalers  Wholesalers: Health Care
13     Cardinal Health  129976.0   1288.0      40400  Wholesalers  Wholesalers: Health Care
237       Henry Schein   12462.0    406.3      22000  Wholesalers  Wholesalers: Health Care
317      Owens & Minor    9318.0     72.8       8600  Wholesalers  Wholesalers: Health Care
489          Patterson    5593.0    170.9       7500  Wholesalers  Wholesalers: Health Care

The aggregation of a multi-index object is similar to a single index object.


GroupBy is one of the best methods to slice and dice your data, and to get to a particular aggregation level with the least amount of effort. Hopefully, this article was able to show you that.


Spread the love

One thought on “GroupBy: One of the most powerful methods available in Pandas

  1. Hi, just required you to know I he added your site to my Google bookmarks due to your layout. But seriously, I believe your internet site has 1 in the freshest theme I??ve came across. It extremely helps make reading your blog significantly easier.

Leave a Reply

Your email address will not be published. Required fields are marked *