October 15, 2024

A tour of Pandas: one of the most commonly used Python libraries for data

Spread the love

This is the first article in my series about Pandas. In this article, I will quickly go over the basic functionalities related to Pandas. This should provide you with the most basic understanding of this library.


Importing data

There are multiple file types supported by pandas, but for the sake of this article let’s import a CSV file stored locally.

movies = pd.read_csv('resources/movies.csv')
Output:
>>> movies
     Rank                         Title           Studio       Gross  Year
0       1             Avengers: Endgame      Buena Vista  $2,796.30   2019
1       2                        Avatar              Fox  $2,789.70   2009
2       3                       Titanic        Paramount  $2,187.50   1997
3       4  Star Wars: The Force Awakens      Buena Vista  $2,068.20   2015
4       5        Avengers: Infinity War      Buena Vista  $2,048.40   2018
..    ...                           ...              ...         ...   ...
777   778                     Yogi Bear  Warner Brothers    $201.60   2010
778   779           Garfield: The Movie              Fox    $200.80   2004
779   780                   Cats & Dogs  Warner Brothers    $200.70   2001
780   781      The Hunt for Red October        Paramount    $200.50   1990
781   782                      Valkyrie              MGM    $200.30   2008
  • The first row in the file contains the column headers for the data.
  • By default, a preview is show for the top 5 and bottom 5 rows.
  • Pandas import the data into an object called DataFrame, which along side Series is the bread and butter of this library.
  • DataFrame consists of multiple column data sets, each column is of type Series.
  • When we do not explicityly tell which column to use as index, then by default the library generates a numeric index column starting from 0.
  • We can modify the code to explicitly tell which column to use as index.
  • Index should not contain duplicate, as this column is used for sorting and filtering internally by the library.

Let’s update the code to set the Title as our index column:

movies = pd.read_csv(
    'resources/movies.csv',
    index_col='Title'
    )

Output:
>>> movies
                              Rank           Studio       Gross  Year
Title
Avengers: Endgame                1      Buena Vista  $2,796.30   2019
Avatar                           2              Fox  $2,789.70   2009
Titanic                          3        Paramount  $2,187.50   1997
Star Wars: The Force Awakens     4      Buena Vista  $2,068.20   2015
Avengers: Infinity War           5      Buena Vista  $2,048.40   2018
...                            ...              ...         ...   ...
Yogi Bear                      778  Warner Brothers    $201.60   2010
Garfield: The Movie            779              Fox    $200.80   2004
Cats & Dogs                    780  Warner Brothers    $200.70   2001
The Hunt for Red October       781        Paramount    $200.50   1990
Valkyrie                       782              MGM    $200.30   2008
  • Setting one of the columns as index removes that column from the table.

Common Properties of DataFrame

  • Check out the size and shape of our data
    • .shape -> returns a tuple of rows and cols
    • .size -> returns total number of cells, ie. rows * cols
    • Note that since we explicitly set one of the columns, Title, as index, that column won’t be considered while calculating the size and shape of the DataFrame
shape = movies.shape
size = movies.size
Output:
>>> shape
(782, 4)
>>> size
3128

  • Data types of the columns
movies.dtypes
Output:
>>> movies.dtypes
Rank       int64
Studio    object
Gross     object
Year       int64
dtype: object

Viewing data within your DataFrame

View top x or bottom x rows of your DataFrame

movies.head(3)
movies.tail(4)
Output:
>>> movies.head(3)
                   Rank       Studio       Gross  Year
Title
Avengers: Endgame     1  Buena Vista  $2,796.30   2019
Avatar                2          Fox  $2,789.70   2009
Titanic               3    Paramount  $2,187.50   1997
>>> movies.tail(4)
                          Rank           Studio     Gross  Year
Title
Garfield: The Movie        779              Fox  $200.80   2004
Cats & Dogs                780  Warner Brothers  $200.70   2001
The Hunt for Red October   781        Paramount  $200.50   1990
Valkyrie   

By default, it shows 5 rows. We can change that by passing an integer which represents the number of rows you want to view


Viewing a particular row of your DataFrame

  • .iloc -> index location, takes in integer value
    • index in a DataFrame starts at 0
  • .loc -> location name, takes in a text value
    • Kind of obvious but the text value is case sensitive
movies.iloc[499]
movies.loc['Maze Runner: The Death Cure']
Output:
>>> movies.iloc[499]
Rank           500
Studio         Fox
Gross     $288.30
Year          2018
Name: Maze Runner: The Death Cure, dtype: object
>>> movies.loc['Maze Runner: The Death Cure']
Rank           500
Studio         Fox
Gross     $288.30
Year          2018
Name: Maze Runner: The Death Cure, dtype: object
  • When selecting only one row, Series object is returned.

  • If duplicate index values are present then a DataFrame is returned.
    • Although pandas allow duplicate index but it’s recommended to keep index labels uniquie. It accelerates the speed at which pandas can locate and extract a specific row.
movies.loc['101 Dalmatians']
Output:
>>> movies.loc['101 Dalmatians']
                Rank       Studio     Gross  Year
Title
101 Dalmatians   425  Buena Vista  $320.70   1996
101 Dalmatians   708  Buena Vista  $215.90   1961

Sorting the DataFrame

  • .sort_values -> sorts the DataFrame by the given value, you can add multiple values as list
  • .sort_index -> sorts the DataFrame by the index value
  • Both methods take additional argument: ascending which is of boolean value and sets the order of the sort, the default value is True
movies.sort_values('Studio').head()
movies.sort_values(['Studio', 'Year'], ascending=False).head(3)
movies.sort_index(ascending=False).head()

Output:
>>> movies.sort_index(ascending=False).head()
                                Rank           Studio       Gross  Year
Title
xXx: The Return of Xander Cage   385        Paramount    $346.10   2017
Zootopia                          37      Buena Vista  $1,023.80   2016
Your Name.                       356              FUN    $358.00   2017
You've Got Mail                  582  Warner Brothers    $250.80   1998
Yogi Bear                        778  Warner Brothers    $201.60   2010

>>> movies.sort_values(['Studio', 'Year'], ascending=False).head(3)
                    Rank     Studio     Gross  Year
Title
The Imitation Game   642  Weinstein  $233.60   2014
The Intouchables     265  Weinstein  $426.60   2012
Django Unchained     266  Weinstein  $425.40   2012

>>> movies.sort_values('Studio').head()
                                                    Rank       Studio     Gross  Year
Title
The Blair Witch Project                              588      Artisan  $248.60   1999
The Sixth Sense                                      122  Buena Vista  $672.80   1999
Brother Bear                                         584  Buena Vista  $250.40   2003
Big Hero 6                                           127  Buena Vista  $657.80   2014
Pirates of the Caribbean: The Curse of the Blac...   128  Buena Vista  $654.30   2003
  • I have also filtered the sorted DataFrame to show the top rows.

Counting values in a Series

  • Selecting a particluar column from our DataFrame
movies['Studio']
Output:
>>> movies['Studio']
Title
Avengers: Endgame                   Buena Vista
Avatar                                      Fox
Titanic                               Paramount
Star Wars: The Force Awakens        Buena Vista
Avengers: Infinity War              Buena Vista
                                     ...
Yogi Bear                       Warner Brothers
Garfield: The Movie                         Fox
Cats & Dogs                     Warner Brothers
The Hunt for Red October              Paramount
Valkyrie                                    MGM
Name: Studio, Length: 782, dtype: object

  • Let’s count the number of times Studio value occurs in the Series
    • .value_counts -> method that counts the count of the values inside the series
    • It also takes ascending as an argument
movies['Studio'].value_counts().head(5)

Output:
>>> movies['Studio'].value_counts().head(5)
Warner Brothers    132
Buena Vista        125
Fox                117
Universal          109
Sony                86
Name: Studio, dtype: int64

Filtering Columns in your DataFrame

  • The general idea is to create a condition logic which returns a boolean value, which is then used to filter the True conditions within a DataFrame
  • We can also store the logic in a variable which then can be used for the filtering of your DataFrame
  • Single Condition
released_by_universal = movies['Studio'] == 'Universal'
movies[released_by_universal]
Output:
>>> released_by_universal = movies['Studio'] == 'Universal'
>>> movies[released_by_universal]
                                Rank     Studio       Gross  Year
Title
Jurassic World                     6  Universal  $1,671.70   2015
Furious 7                          8  Universal  $1,516.00   2015
Jurassic World: Fallen Kingdom    13  Universal  $1,309.50   2018
The Fate of the Furious           17  Universal  $1,236.00   2017
Minions                           19  Universal  $1,159.40   2015
...                              ...        ...         ...   ...
The Break-Up                     763  Universal    $205.00   2006
Everest                          766  Universal    $203.40   2015
Patch Adams                      772  Universal    $202.30   1998
Kindergarten Cop                 775  Universal    $202.00   1990
Straight Outta Compton           776  Universal    $201.60   2015
[109 rows x 4 columns]
  • If we look at the output for the condition, then we can see it’s a Series object with values True or False

  • Multiple Conditions:
    • We can also use multiple conditions to filter our DataFrame
    • We can use logical operators, & (and) | (or), to combine multiple conditions
released_by_universal = movies['Studio'] == 'Universal'
released_in_2015 = movies['Year'] == 2015
# Movies released by Universal AND in 2015
movies[released_by_universal & released_in_2015]
# Movies released by Universal OR in 2015
movies[released_by_universal | released_in_2015]

Additional methods for filtering the DataFrame

  • Less than or Greater than a specific value (only works with numeric values)
    • movies[“Year”] < 1900
    • movies[“Year”] > 1900
  • In between some range of values (only works with numeric values)
    • movies[“Year”].between(1900, 2000)
    • the arguments of the between method are inclusive
  • Text value containing a specific string
    • movies.index.str.lower().str.contains(“dark”)

Grouping data

Let’s try to figure out which is the highest-grossing studio. To do this, we first need to update the Gross column to a float value. To do this, we first need to remove “$” and “,” from that column.
Make sure to edit the column, not only output the result by setting the column to the new value.

movies['Gross'] = movies['Gross'].str.replace('$', '').str.replace(',', '').astype(float)

Now we have our Gross column data as float values, we can group our data by Studio. Please note, since we are grouping our data by a column we need to provide an aggregation method for the numeric columns. We can only include the columns which we need, but we need to provide aggregation for one of the columns.

movies.groupby('Studio').agg({'Gross': 'sum'}).sort_values('Gross', ascending=False).head()
Output:
>>> movies.groupby('Studio').agg({'Gross': 'sum'}).sort_values('Gross', ascending=False).head()
                   Gross
Studio
Buena Vista      73585.0
Warner Brothers  58643.8
Fox              50420.8
Universal        44302.3
Sony             32822.5
  • We first group our DataFrame by “Studio”
  • Then we apply sum aggregation to our “Gross” column
  • Then we sort the values by “Gross” column in a descending way (ascending=False)
  • Then we filter our data to show only the top 5 rows

We barely scratched the surface here, but at least this gives you an understanding of what exactly we can do with Pandas in Python. It is one of the most important and powerful tools to have in your tool kit. I will be writing more articles that go deeper into the topics. So keep a lookout for that.


Disclaimer: I have referred to the “Pandas in Action” book by Borris Pashkaver. It is one of the best books you can find out on this topic.


Spread the love

Leave a Reply

Your email address will not be published. Required fields are marked *