The data usually contains multiple levels of hierarchies, which can be grouped in multiple ways for our analysis. Pandas offer a method for tackling such situations.
Let’s look at our data
Before we proceed with the article, let’s quickly take a look at our data
>>> fortune.head()
Company Revenues Profits Employees Sector Industry
0 Walmart 500343.0 9862.0 2300000 Retailing General Merchandisers
1 Exxon Mobil 244363.0 19710.0 71200 Energy Petroleum Refining
2 Berkshire Hathaway 242137.0 44940.0 377000 Financials Insurance: Property and Casualty (Stock)
3 Apple 229234.0 48351.0 123000 Technology Computers, Office Equipment
4 UnitedHealth Group 201159.0 10558.0 260000 Health Care Health Care: Insurance and Managed Care
>>> fortune.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Company 1000 non-null object
1 Revenues 1000 non-null float64
2 Profits 998 non-null float64
3 Employees 1000 non-null int64
4 Sector 1000 non-null object
5 Industry 1000 non-null object
dtypes: float64(2), int64(1), object(3)
memory usage: 47.0+ KB
- The data contains 1000 rows and 6 columns.
- We have Company Data broken out by Company, Sector and Industry. The available measures are Revenues, Profits and Employees.
- Looking at the data we can understand that a Sector can have multiple Industry and an Industry can have multiple Company.
- So the hierarchy is Sector -> Industry -> Company.
General Syntax for GroupBy
Let’s look at the parameters for GroupBy
You can find the full documentation here.
Parameter | Description |
---|---|
by | mapping, function, label, or list of labels Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups. If a ndarray is passed, the values are used as-is to determine the groups. A label or list of labels may be passed to the group by the columns in self. Notice that a tuple is interpreted as a (single) key. |
axis | {0 or ‘index’, 1 or ‘columns’}, default 0 Split along rows (0) or columns (1). |
level | int, level name, or sequence of such, default None If the axis is a MultiIndex (hierarchical), group by a particular level or levels. |
as_index | bool, default True For aggregated output, return an object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output. |
sort | bool, default True Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group. |
group_keys | bool, default True When calling apply, add group keys to the index to identify pieces. |
observed | bool, default False This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers. |
dropna | bool, default True If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups |
Grouping data by a column
In our data, since Sector is that main header we can think of grouping the data by Sector and then try to do some analysis
sectors = fortune.groupby('Sector')
>>> sectors
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000017215A736D0>
Grouping a dataset using groupby returns a DataFrameGroupBy object, which provides different methods and properties which we can utilize for our analysis.
Methods and Properties of DataFrameGroupBy Object
Let’s look at some of the properties and methods for our DataFrameGroupBy object
Methods / Properties | Description |
---|---|
.size() | returns a Series object with grouped elements as index and count of unique occurrences as labels |
.groups | returns a dictionary with keys as the grouped elements and values as a list of indexes where those elements are present in our dataset |
.get_group() | name: object The name of the group to get as a DataFrame obj: DataFrame, default None The DataFrame to take the DataFrame out of. If it is None, the object groupby was called on will be used returns: A group object with the group name |
.first() | numeric_only: bool, default False Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. min_count: int, default -1 The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA. returns: a Series or DataFrame with first of values within each group |
.last() | numeric_only: bool, default False Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. min_count: int, default -1 The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA. returns: a Series or DataFrame with last of values within each group |
.nth() | n: int or list of ints A single nth value for the row or a list of nth values dropna: {‘any’, ‘all’, None}, default None Apply the specified dropna operation before counting which row is the nth row returns: a Series or DataFrame with nth value within each group |
Let’s look at some of the methods and properties:
>>> sectors.size()
Sector
Aerospace & Defense 25
Apparel 14
Business Services 53
Chemicals 33
Energy 107
Engineering & Construction 27
Financials 155
Food & Drug Stores 12
Food, Beverages & Tobacco 37
Health Care 71
Hotels, Restaurants & Leisure 26
Household Products 28
Industrials 49
Materials 45
Media 25
Motor Vehicles & Parts 19
Retailing 77
Technology 103
Telecommunications 10
Transportation 40
Wholesalers 44
dtype: int64
>>> sectors.get_group('Apparel').head()
Company Revenues Profits Employees Sector Industry
88 Nike 34350.0 4240.0 74400 Apparel Apparel
241 VF 12400.0 614.9 69000 Apparel Apparel
331 PVH 8915.0 537.8 28050 Apparel Apparel
420 Ralph Lauren 6653.0 -99.3 18250 Apparel Apparel
432 Hanesbrands 6478.0 61.9 67200 Apparel Apparel
>>> sectors.first().head()
Company Revenues Profits Employees Industry
Sector
Aerospace & Defense Boeing 93392.0 8197.0 140800 Aerospace and Defense
Apparel Nike 34350.0 4240.0 74400 Apparel
Business Services ManpowerGroup 21034.0 545.4 29000 Temporary Help
Chemicals DowDuPont 62683.0 1460.0 98000 Chemicals
Energy Exxon Mobil 244363.0 19710.0 71200 Petroleum Refining
Aggregate Operations
GroupBy object provides various aggregation methods, some of them are mentioned below:
Please Note: We can also use apply method similar to a DataFrame object. For a GroupBy object, the function is applied to individual groups.
Method | Description |
---|---|
.sum() | numeric_only: bool, default True Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. min_count: int, default 0 The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA. returns: a Series or DataFrame with the sum of values within each group |
.mean() | numeric_only: bool, default True Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. returns: a Series or DataFrame with the mean of values within each group |
.min() | numeric_only: bool, default False Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. min_count: int, default -1 The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA. returns: a Series or DataFrame with the min of values within each group |
.max() | numeric_only: bool, default False Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. min_count: int, default -1 The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA. returns: a Series or DataFrame with the max of values within each group |
.agg() | func: function, str, list or dict Function to use for aggregating the data. If a function, must either work when passed a Series or when passed to Series.apply. Accepted combinations are: – function – string function name – list of functions and/or function names, e.g. [np.sum, ‘mean’] – dict of axis labels -> functions, function names or list of such. eg. {“column_name”: “sum”} |
Let’s look at some examples:
>>> sectors.sum().head()
Revenues Profits Employees
Sector
Aerospace & Defense 383835.0 26733.5 1010124
Apparel 101157.3 6350.7 355699
Business Services 316090.0 37179.2 1593999
Chemicals 251151.0 20475.0 474020
Energy 1543507.2 85369.6 981207
>>> sectors.mean().head()
Revenues Profits Employees
Sector
Aerospace & Defense 15353.400000 1069.340000 40404.960000
Apparel 7225.521429 453.621429 25407.071429
Business Services 5963.962264 701.494340 30075.452830
Chemicals 7610.636364 620.454545 14364.242424
Energy 14425.300935 805.373585 9170.158879
aggregation_function = {
'Revenues': 'sum',
'Profits': 'max',
'Employees': 'mean',
}
>>> sectors.agg(aggregation_function).head()
Revenues Profits Employees
Sector
Aerospace & Defense 383835.0 8197.0 40404.960000
Apparel 101157.3 4240.0 25407.071429
Business Services 316090.0 6699.0 30075.452830
Chemicals 251151.0 3000.4 14364.242424
Energy 1543507.2 19710.0 9170.158879
Grouping by multiple columns
It’s also possible to group the DataFrame by multiple columns. Pandas will create a multi-index Series/DataFrame. Let’s group Sector and Industry together and see the result:
sector_and_industry = fortune.groupby(by=['Sector', 'Industry'])
>>> sector_and_industry.size()
Sector Industry
Aerospace & Defense Aerospace and Defense 25
Apparel Apparel 14
Business Services Advertising, marketing 2
Diversified Outsourcing Services 14
Education 2
..
Transportation Trucking, Truck Leasing 11
Wholesalers Wholesalers: Diversified 24
Wholesalers: Electronics and Office Equipment 8
Wholesalers: Food and Grocery 6
Wholesalers: Health Care 6
Length: 82, dtype: int64
- Here we can see that the Sector is further divided into Industry.
- The sequence by which the DataFrame is grouped is very important, as that is the heirarchy in which our output is displayed.
Since it’s a multi-index object, we need to provide a tuple with multiple grouping keys to get_group.
>>> sector_and_industry.get_group(('Wholesalers', 'Wholesalers: Health Care'))
Company Revenues Profits Employees Sector Industry
5 McKesson 198533.0 5070.0 64500 Wholesalers Wholesalers: Health Care
11 AmerisourceBergen 153144.0 364.5 19500 Wholesalers Wholesalers: Health Care
13 Cardinal Health 129976.0 1288.0 40400 Wholesalers Wholesalers: Health Care
237 Henry Schein 12462.0 406.3 22000 Wholesalers Wholesalers: Health Care
317 Owens & Minor 9318.0 72.8 8600 Wholesalers Wholesalers: Health Care
489 Patterson 5593.0 170.9 7500 Wholesalers Wholesalers: Health Care
The aggregation of a multi-index object is similar to a single index object.
GroupBy is one of the best methods to slice and dice your data, and to get to a particular aggregation level with the least amount of effort. Hopefully, this article was able to show you that.
Hi, just required you to know I he added your site to my Google bookmarks due to your layout. But seriously, I believe your internet site has 1 in the freshest theme I??ve came across. It extremely helps make reading your blog significantly easier.