October 15, 2024

DataFrame: The workhorse of Pandas

Spread the love

A DataFrame is a two-dimensional table of data that can store different data types in different columns. Pandas assign an index label and an index position to each row as well as columns within the DataFrame. It’s easier to imagine DataFrames as a collection of Series.


Properties and Methods for DataFrame

DataFrame share lots of properties and methods with Series, so I will not be going over those in this article. Feel free to refer to my article of Series. But let’s take a look at some general properties of DataFrame

Properties / MethodsDescription
.dtypesreturns a series of column names as index and data types as values
.columnsreturns an index object with the column as values
.ndimreturns number of dimensions, a DataFrame has 2 dimensions
.shapereturns a tuple with several rows and columns
.sizereturns the total number of values in the data set, it includes NaN values as well
.info()returns a DataFrame with columns, non-null counts and dtype
.astype()for converting dtypes of columns within DataFrame, takes the new data type as an argument
.fillna()replaces NaN values with the provided argument

Here’s a list of the shared methods between Series and DataFrames:

  • head()
  • tail()
  • sample()
  • nunique()
  • max()
  • min()
  • nlargest()
    • requires you to pass two arguments, n and columns
  • nsmallest()
    • requires you to pass two arguments, n and columns
  • sum()
    • can provide an argument, numeric_only=True, for outputing only numeric values
  • mean()
  • median()
  • mode()
  • std()

Sorting a DataFrame

  • sort_values
    • We can use sort_values for sorting our DataFrame by one or more columns. Its implementation is pretty similar to Series.
    • For sorting by multiple columns we can provide a list of all the columns and the DataFrame is sorted by those columns sequentially.
    • In case where we want to specify a specific sort order for multiple columns, we can pass a list of boolean values for ascending parameter.
  • sort_index
    • Similar to Series we can use sort_index to sort the DataFrame by index
    • Since there is an additional column index in DataFrame, sort_index takes in an additional parameter, axis, and pass it an argument of “columns” or 1. By default, it’s set to “rows” or 0.

Selecting Columns in DataFrames

Selecting single columns
  • We can either use a dot notation or pass the column name inside of square brackets.
  • Dot notation does not work when a column name has a space in between, so I would recommend using the square bracket notation always.
  • When selecting single column with single pair for square brackets, a Series object is returned.
  • If you need to return a DataFrame instead, we can include the column name inside two pairs of square brackets.
# Returns a Series Object
>>> nba["Name"].head()
0      Shake Milton
1    Christian Wood
2     PJ Washington
3      Derrick Rose
4     Marial Shayok
Name: Name, dtype: object

>>> type(nba["Name"].head())
<class 'pandas.core.series.Series'>

# Returns a DataFrame Object
>>> nba[["Name"]].head()
             Name
0    Shake Milton
1  Christian Wood
2   PJ Washington
3    Derrick Rose
4   Marial Shayok

>>> type(nba[["Name"]].head())
<class 'pandas.core.frame.DataFrame'>

Selecting multiple columns
  • For selecting multiple columns, we simply provide the list of columns we want inside two pairs of square brackets.
  • It always returns a DataFrame object, since we are selecting multiple columns and a Series object can not have multiple columns
  • Pandas will extract the columns based on their order in the list.
>>> nba[["Name", "Birthday"]].head()
             Name   Birthday
0    Shake Milton 1996-09-26
1  Christian Wood 1995-09-27
2   PJ Washington 1998-08-23
3    Derrick Rose 1988-10-04
4   Marial Shayok 1995-07-26

Using select_dtypes for selecting columns
  • We can also use select_dtypes method for selecting columns.
  • It accepts two parameters, include and exclude.
  • The arguments to these parameters can either be a string or a list of strings represeting the column type(s) that pandas should keep or discard.
# Excluded columns with int64 data type
nba.select_dtypes(exclude=['object'])

Selecting Rows in DataFrame

  • Before we proceed with selecting rows by index labels, let’s quickly update the index with the Name column
>>> nba.head()
                              Team Position   Birthday   Salary
Name
Shake Milton    Philadelphia 76ers       SG 1996-09-26  1445697
Christian Wood     Detroit Pistons       PF 1995-09-27  1645357
PJ Washington    Charlotte Hornets       PF 1998-08-23  3831840
Derrick Rose       Detroit Pistons       PG 1988-10-04  7317074
Marial Shayok   Philadelphia 76ers        G 1995-07-26    79568

Selecting rows by index labels
  • We can use ” .loc ” method to extract rows from a DataFrame.
  • It takes in the index label or a list of index labels as an argument.
  • While selecting single row, we can either have it return a Series object or a DataFrame object depending on the pairs of square brackets.
  • We can also extract a sequence of index labels. The syntax mirros Python’s list slicing syntax. We provide the starting value, a colon, and the ending value. For such extractions, it’s recommended to sort the index first, as it accelerates the speed.
    • The lower and upper bounds are inclusive.
    • We can keep the start value empty to extract data from the start
    • We can keep the ending value emply to extract data till the end
>>> nba.loc['Austin Rivers']
Team            Houston Rockets
Position                     PG
Birthday    1992-08-01 00:00:00
Salary                  2174310
Name: Austin Rivers, dtype: object

>>> type(nba.loc['Austin Rivers'])
<class 'pandas.core.series.Series'>


>>> nba.loc[['Austin Rivers']]
                          Team Position   Birthday   Salary
Name
Austin Rivers  Houston Rockets       PG 1992-08-01  2174310

>>> type(nba.loc[['Austin Rivers']])
<class 'pandas.core.frame.DataFrame'>

>>> nba.loc[['Austin Rivers', 'Shake Milton']]
                             Team Position   Birthday   Salary
Name
Austin Rivers     Houston Rockets       PG 1992-08-01  2174310
Shake Milton   Philadelphia 76ers       SG 1996-09-26  1445697


>>> nba.sort_index().loc['Marial Shayok' : 'Shake Milton']
                                           Team Position   Birthday    Salary
Name
Marial Shayok                Philadelphia 76ers        G 1995-07-26     79568
Mario Hezonja            Portland Trail Blazers       PF 1995-02-25   1737145
Markelle Fultz                    Orlando Magic       PG 1998-05-29   9745200
Markieff Morris                 Detroit Pistons       PF 1989-09-02   3200000
Marko Guduric                 Memphis Grizzlies       SG 1995-03-08   2625000
...                                         ...      ...        ...       ...
Serge Ibaka                     Toronto Raptors        C 1989-09-18  23271604
Seth Curry                     Dallas Mavericks       PG 1990-08-23   7461380
Shabazz Napier           Minnesota Timberwolves       PG 1991-07-14   1845301
Shai Gilgeous-Alexander   Oklahoma City Thunder       PG 1998-07-12   3952920
Shake Milton                 Philadelphia 76ers       SG 1996-09-26   1445697

[88 rows x 4 columns]

Selecting rows by index position
  • We can use ” .iloc ” method for extracting rows by index position.
  • The syntax is similar to ” .loc “, instead of actual labels we pass the index location (integer values)
  • When specifying the range using the colon sytax, but the upper bound is exclusive.
    • You can also specify negative index values similar to python slicing.
    • You can also give a third argument inside the colon sytax which represents the step of the range.
    • For example, [0 : 10 : 2] selects 0, 2, 4, 6, 8 indexes

Selecting values from specific rows and columns

Using loc and iloc
  • Both ” loc ” and ” iloc ” attributes can accept a second argument representing the column(s) to extract.
  • If we are using ” loc “, we have to provide the column name.
  • If we are using ” iloc “, we have to provide the colomn positions.
  • We can use all the above variations when using loc and iloc.

Some examples:

>>> nba.loc["Austin Rivers", "Team"]
'Houston Rockets'

>>> nba.loc['Marial Shayok', ['Position', 'Birthday']]
Position                      G
Birthday    1995-07-26 00:00:00
Name: Marial Shayok, dtype: object

>>> nba.loc[['Austin Rivers', 'Marial Shayok'], ['Team', 'Position']]
                             Team Position
Name
Austin Rivers     Houston Rockets       PG
Marial Shayok  Philadelphia 76ers        G

>>> nba.loc['Austin Rivers', 'Position' : 'Salary']
Position                     PG
Birthday    1992-08-01 00:00:00
Salary                  2174310

>>> nba.iloc[73, 3]
79568

>>> nba.iloc[10:15, :2]
                                  Team Position
Name
Jordan Clarkson    Cleveland Cavaliers       PG
Alex Caruso         Los Angeles Lakers       PG
Norvel Pelle        Philadelphia 76ers       FC
Tyler Johnson             Phoenix Suns       PG
Alec Burks       Golden State Warriors       SG

Using at and iat
  • When selecting a value using a combination of rows and columns, pandas provide us with two additional attributes: ” at ” and ” iat “
  • at -> can be compared to loc
  • iat -> can be compared to iloc
  • We can only provide one value for the row and column parameters. We can’t provide lists or sequence like we can for loc and iloc attributes.
  • Extracting value using at and iat is much faster compared to the same using loc and iloc respectively.

Renaming columns or rows

  • We can rename the columns / rows by giving a new list of names.
  • Or we can use the method provided by pandas, rename

Renaming by giving a new list
  • Both index and columns are not mutable, hence we need to provide a new list instead of changing one value within that list.
>>> nba.columns
Index(['Team', 'Position', 'Birthday', 'Salary'], dtype='object')

>>> nba.columns = ['Team', 'Position', 'Date of Birth', 'Pay']

>>> nba.columns
Index(['Team', 'Position', 'Date of Birth', 'Pay'], dtype='object')

Using rename method
  • Another approach for changing columns or indexes is by using the rename method.
  • It takes two parameters, columns and index, in which you can pass a dictionary as its argument.
  • The dictionary is a key value pair of the orginal value and the new value.
  • This method returns a new DataFrame which can then be saved over the original DataFrame.
  • This way we can update singular column or indexes as well.
>>> nba = nba.rename(columns={'Pay': 'Salary'})
>>> nba.head()
                              Team Position Date of Birth   Salary
Name
Shake Milton    Philadelphia 76ers       SG    1996-09-26  1445697
Christian Wood     Detroit Pistons       PF    1995-09-27  1645357
PJ Washington    Charlotte Hornets       PF    1998-08-23  3831840
Derrick Rose       Detroit Pistons       PG    1988-10-04  7317074
Marial Shayok   Philadelphia 76ers        G    1995-07-26    79568

>>> nba = nba.rename(
...     index={'PJ Washington': 'Paul Jamaine Washington'},
...     columns={'Date of Birth': 'Birthday'}
...     )
>>>
>>> nba.head()
                                       Team Position   Birthday   Salary
Name
Shake Milton             Philadelphia 76ers       SG 1996-09-26  1445697
Christian Wood              Detroit Pistons       PF 1995-09-27  1645357
Paul Jamaine Washington   Charlotte Hornets       PF 1998-08-23  3831840
Derrick Rose                Detroit Pistons       PG 1988-10-04  7317074
Marial Shayok            Philadelphia 76ers        G 1995-07-26    79568

Resetting the index

  • Earlier we used set_index to set one of the columns as the index. This causes our column to be set as an index and is taken out of the DataFrame.
  • If we then again set an index to another column, we would lose the initial column.
  • In order to update the index, we first need to reset the index using reset_index method and then we can use set_index to update our index.
>>> nba.head()
                                       Team Position   Birthday   Salary
Name
Shake Milton             Philadelphia 76ers       SG 1996-09-26  1445697
Christian Wood              Detroit Pistons       PF 1995-09-27  1645357
Paul Jamaine Washington   Charlotte Hornets       PF 1998-08-23  3831840
Derrick Rose                Detroit Pistons       PG 1988-10-04  7317074
Marial Shayok            Philadelphia 76ers        G 1995-07-26    79568

>>> nba.reset_index('Name').head()
                      Name                Team Position   Birthday   Salary
0             Shake Milton  Philadelphia 76ers       SG 1996-09-26  1445697
1           Christian Wood     Detroit Pistons       PF 1995-09-27  1645357
2  Paul Jamaine Washington   Charlotte Hornets       PF 1998-08-23  3831840
3             Derrick Rose     Detroit Pistons       PG 1988-10-04  7317074
4            Marial Shayok  Philadelphia 76ers        G 1995-07-26    79568

>>> nba.set_index('Team').head()
                   Position   Birthday   Salary
Team
Philadelphia 76ers       SG 1996-09-26  1445697
Detroit Pistons          PF 1995-09-27  1645357
Charlotte Hornets        PF 1998-08-23  3831840
Detroit Pistons          PG 1988-10-04  7317074
Philadelphia 76ers        G 1995-07-26    79568

Memory Optimization in DataFrame

  • DataFrame can grow pretty quickly as it stores our data, so saving tiny of memory can result in huge performance improvements.
  • One way to reduce space is by converting the data types in appropriate data types. By default, pandas assign a data type to capture multiple scenarios. For example, boolean and date time values are stored as string values.
    • We can use .astype() method to change the data types for various columns.
  • Pandas also include a special data type called a category. It is ideal for a column consisting a small number of unique values relative to its total size.
    • Some candidates for category are Days, Months, Income Groups, Gender, etc.
    • Behind the scene, pandas store only one copy of each category rather than storing the duplicate accross rows and then simply points to that category when needed.

Filtering a DataFrame

  • The general idea is to create a condition logic which returns a boolean value, which is then used to filter the True conditions within a DataFrame.
  • We can also store the logic in a variable which then can be used for the filtering of your DataFrame.
  • We can have multiple conditions for filtering our DataFrame as well.

Logical Operations

When we have multiple conditions we can use the logical operator to combine them. Let’s look at some examples:

>>> detroit_pistons = nba['Team'] == 'Detroit Pistons'
>>> pf_position = nba['Position'] == 'PF'

# AND Operation
>>> nba[detroit_pistons & pf_position]
                Name             Team Position   Birthday    Salary
1     Christian Wood  Detroit Pistons       PF 1995-09-27   1645357
249  Markieff Morris  Detroit Pistons       PF 1989-09-02   3200000
317    Blake Griffin  Detroit Pistons       PF 1989-03-16  34449964

# OR Operation
>>> nba[detroit_pistons | pf_position].head()
              Name                   Team Position   Birthday    Salary
1   Christian Wood        Detroit Pistons       PF 1995-09-27   1645357
2    PJ Washington      Charlotte Hornets       PF 1998-08-23   3831840
3     Derrick Rose        Detroit Pistons       PG 1988-10-04   7317074
5   Draymond Green  Golden State Warriors       PF 1990-03-04  18539130
15  JaMychal Green   Los Angeles Clippers       PF 1990-06-21   4767000

# NOT Operation
>>> nba[~pf_position].head()
            Name                 Team Position   Birthday   Salary
0   Shake Milton   Philadelphia 76ers       SG 1996-09-26  1445697
3   Derrick Rose      Detroit Pistons       PG 1988-10-04  7317074
4  Marial Shayok   Philadelphia 76ers        G 1995-07-26    79568
6  Kendrick Nunn           Miami Heat       SG 1995-08-03  1416852
7     Cedi Osman  Cleveland Cavaliers       SF 1995-04-08  2907143

# GREATER THAN Operation
>>> salary_gt_100000 = nba['Salary'] > 100000
>>> nba[salary_gt_100000].head()
             Name                   Team Position   Birthday    Salary
0    Shake Milton     Philadelphia 76ers       SG 1996-09-26   1445697
1  Christian Wood        Detroit Pistons       PF 1995-09-27   1645357
2   PJ Washington      Charlotte Hornets       PF 1998-08-23   3831840
3    Derrick Rose        Detroit Pistons       PG 1988-10-04   7317074
5  Draymond Green  Golden State Warriors       PF 1990-03-04  18539130


Special Methods

Alongside the logical operators, pandas also provide some special methods for filtering the DataFrame

  • isin -> can be used to check if the given value is inside the list
    • We can achieve the same result by combining multiple OR conditions
  • between -> can be used to check if the given value is inside the given range
    • The lower bound is inclusive whereas the upper bound is exclusive
  • isnull -> used to check if rows have missing value or not
  • notnull -> returns True for all the non null values

Let’s look at some examples:

>>> position_list = ['SG', 'PF', 'GF']
>>> desired_positions = nba['Position'].isin(position_list)
>>> nba[desired_positions].head()
             Name                   Team Position   Birthday    Salary
0    Shake Milton     Philadelphia 76ers       SG 1996-09-26   1445697
1  Christian Wood        Detroit Pistons       PF 1995-09-27   1645357
2   PJ Washington      Charlotte Hornets       PF 1998-08-23   3831840
5  Draymond Green  Golden State Warriors       PF 1990-03-04  18539130
6   Kendrick Nunn             Miami Heat       SG 1995-08-03   1416852

>>> medium_salary_range = nba['Salary'].between(100000, 1500000)
>>> nba[medium_salary_range].head()
               Name                   Team Position   Birthday   Salary
0      Shake Milton     Philadelphia 76ers       SG 1996-09-26  1445697
6     Kendrick Nunn             Miami Heat       SG 1995-08-03  1416852
30  Nicolas Claxton          Brooklyn Nets       PF 1999-04-17   898310
41  DaQuan Jeffries       Sacramento Kings       SG 1997-08-30   898310
42   Hamidou Diallo  Oklahoma City Thunder       SF 1998-07-31  1416852

Dealing with null values

Pandas provide multiple ways to handle null values. We could either replace the null values with some specific value or remove the rows containing null values


Replacing Null Values
  • Pandas provide a “.fillna” method for replacing the null values with a constant
  • The parameters for this method are:
ParameterDescription
valuescalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
method{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use next valid observation to fill the gap.
axis{0 or ‘index’, 1 or ‘columns’}
Axis along which to fill missing values.
inplacebool, default False
If True, fill in-place. Note: this will modify any other views on this object
limitint, default None
If the method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If the method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

Removing rows containing Null values
  • For this purpose Pandas has provided “dropna” method.
  • The parameters fro this method are:
ParameterDescription
axis{0 or ‘index’, 1 or ‘columns’}, default 0
Determine if rows or columns which contain missing values are removed.
– 0, or ‘index’: Drop rows that contain missing values.
– 1, or ‘columns’: Drop columns that contain missing values.
how{‘any’, ‘all’}, default ‘any’
Determine if a row or column is removed from DataFrame when we have at least one NA or all NA.
– ‘any’: If any NA values are present, drop that row or column.
– ‘all’: If all values are NA, drop that row or column.
threshint, optional
Require that many non-NA values.
subsetarray-like, optional
Labels along another axis to consider, e.g. if you are dropping rows these would be a list of columns where pandas will look for Null values.
inplacebool, default False
If True, do the operation inplace and return None.

Dealing with duplicate values

  • Similar to null values, we can either modify the duplicate values or remove them entrirely

Modifying or Identifying duplicate values
  • The duplicated method returns a Boolean series that identifies duplicates in a column.
  • True is returned whenever a duplicate value is encountered.
  • By default, the first occurence is not considered as duplicate. We can change this behavior by passing last as an argument to its keep parameter. The default argument is first.
  • The parameters are as follows:
ParameterDescription
subsetcolumn label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by default use all of the columns.
keep{‘first’, ‘last’, False}, default ‘first’
Determines which duplicates to mark.
– first: Mark duplicates as True except for the first occurrence.
– last: Mark duplicates as True except for the last occurrence.
– False: Mark all duplicates as True.

Removing duplicate values
  • The drop_duplicates method provides a way for removing the duplicates.
  • It’s pretty similar to duplicated method, the only difference is that drop_duplicates deletes the duplicate row instead of marking them as True or False.
  • The parameters are as follows:
ParameterDescription
subsetcolumn label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by default use all of the columns.
keep{‘first’, ‘last’, False}, default ‘first’
Determines which duplicates (if any) to keep.
– first: Drop duplicates except for the first occurrence
– last: Drop duplicates except for the last occurrence
– False: Drop all duplicates.
inplacebool, default False
Whether to drop duplicates in place or to return a copy.
ignore_indexbool, default False
If True, the resulting axis will be labeled 0, 1, …, n – 1.

This concludes our DataFrame. Hopefully, this article covers the important topics and helps you in understanding DataFrames in a much clearer way. I will update the article or may write a new one for particular use cases and general FAQs for DataFrames, so keep an eye out for that.


Spread the love

Leave a Reply

Your email address will not be published. Required fields are marked *