This is the first article in my series about Pandas. In this article, I will quickly go over the basic functionalities related to Pandas. This should provide you with the most basic understanding of this library.
Importing data
There are multiple file types supported by pandas, but for the sake of this article let’s import a CSV file stored locally.
movies = pd.read_csv('resources/movies.csv')
Output:
>>> movies
Rank Title Studio Gross Year
0 1 Avengers: Endgame Buena Vista $2,796.30 2019
1 2 Avatar Fox $2,789.70 2009
2 3 Titanic Paramount $2,187.50 1997
3 4 Star Wars: The Force Awakens Buena Vista $2,068.20 2015
4 5 Avengers: Infinity War Buena Vista $2,048.40 2018
.. ... ... ... ... ...
777 778 Yogi Bear Warner Brothers $201.60 2010
778 779 Garfield: The Movie Fox $200.80 2004
779 780 Cats & Dogs Warner Brothers $200.70 2001
780 781 The Hunt for Red October Paramount $200.50 1990
781 782 Valkyrie MGM $200.30 2008
- The first row in the file contains the column headers for the data.
- By default, a preview is show for the top 5 and bottom 5 rows.
- Pandas import the data into an object called DataFrame, which along side Series is the bread and butter of this library.
- DataFrame consists of multiple column data sets, each column is of type Series.
- When we do not explicityly tell which column to use as index, then by default the library generates a numeric index column starting from 0.
- We can modify the code to explicitly tell which column to use as index.
- Index should not contain duplicate, as this column is used for sorting and filtering internally by the library.
Let’s update the code to set the Title as our index column:
movies = pd.read_csv(
'resources/movies.csv',
index_col='Title'
)
Output:
>>> movies
Rank Studio Gross Year
Title
Avengers: Endgame 1 Buena Vista $2,796.30 2019
Avatar 2 Fox $2,789.70 2009
Titanic 3 Paramount $2,187.50 1997
Star Wars: The Force Awakens 4 Buena Vista $2,068.20 2015
Avengers: Infinity War 5 Buena Vista $2,048.40 2018
... ... ... ... ...
Yogi Bear 778 Warner Brothers $201.60 2010
Garfield: The Movie 779 Fox $200.80 2004
Cats & Dogs 780 Warner Brothers $200.70 2001
The Hunt for Red October 781 Paramount $200.50 1990
Valkyrie 782 MGM $200.30 2008
- Setting one of the columns as index removes that column from the table.
Common Properties of DataFrame
- Check out the size and shape of our data
- .shape -> returns a tuple of rows and cols
- .size -> returns total number of cells, ie. rows * cols
- Note that since we explicitly set one of the columns, Title, as index, that column won’t be considered while calculating the size and shape of the DataFrame
shape = movies.shape
size = movies.size
Output:
>>> shape
(782, 4)
>>> size
3128
- Data types of the columns
movies.dtypes
Output:
>>> movies.dtypes
Rank int64
Studio object
Gross object
Year int64
dtype: object
Viewing data within your DataFrame
View top x or bottom x rows of your DataFrame
movies.head(3)
movies.tail(4)
Output:
>>> movies.head(3)
Rank Studio Gross Year
Title
Avengers: Endgame 1 Buena Vista $2,796.30 2019
Avatar 2 Fox $2,789.70 2009
Titanic 3 Paramount $2,187.50 1997
>>> movies.tail(4)
Rank Studio Gross Year
Title
Garfield: The Movie 779 Fox $200.80 2004
Cats & Dogs 780 Warner Brothers $200.70 2001
The Hunt for Red October 781 Paramount $200.50 1990
Valkyrie
By default, it shows 5 rows. We can change that by passing an integer which represents the number of rows you want to view
Viewing a particular row of your DataFrame
- .iloc -> index location, takes in integer value
- index in a DataFrame starts at 0
- .loc -> location name, takes in a text value
- Kind of obvious but the text value is case sensitive
movies.iloc[499]
movies.loc['Maze Runner: The Death Cure']
Output:
>>> movies.iloc[499]
Rank 500
Studio Fox
Gross $288.30
Year 2018
Name: Maze Runner: The Death Cure, dtype: object
>>> movies.loc['Maze Runner: The Death Cure']
Rank 500
Studio Fox
Gross $288.30
Year 2018
Name: Maze Runner: The Death Cure, dtype: object
- When selecting only one row, Series object is returned.
- If duplicate index values are present then a DataFrame is returned.
- Although pandas allow duplicate index but it’s recommended to keep index labels uniquie. It accelerates the speed at which pandas can locate and extract a specific row.
movies.loc['101 Dalmatians']
Output:
>>> movies.loc['101 Dalmatians']
Rank Studio Gross Year
Title
101 Dalmatians 425 Buena Vista $320.70 1996
101 Dalmatians 708 Buena Vista $215.90 1961
Sorting the DataFrame
- .sort_values -> sorts the DataFrame by the given value, you can add multiple values as list
- .sort_index -> sorts the DataFrame by the index value
- Both methods take additional argument: ascending which is of boolean value and sets the order of the sort, the default value is True
movies.sort_values('Studio').head()
movies.sort_values(['Studio', 'Year'], ascending=False).head(3)
movies.sort_index(ascending=False).head()
Output:
>>> movies.sort_index(ascending=False).head()
Rank Studio Gross Year
Title
xXx: The Return of Xander Cage 385 Paramount $346.10 2017
Zootopia 37 Buena Vista $1,023.80 2016
Your Name. 356 FUN $358.00 2017
You've Got Mail 582 Warner Brothers $250.80 1998
Yogi Bear 778 Warner Brothers $201.60 2010
>>> movies.sort_values(['Studio', 'Year'], ascending=False).head(3)
Rank Studio Gross Year
Title
The Imitation Game 642 Weinstein $233.60 2014
The Intouchables 265 Weinstein $426.60 2012
Django Unchained 266 Weinstein $425.40 2012
>>> movies.sort_values('Studio').head()
Rank Studio Gross Year
Title
The Blair Witch Project 588 Artisan $248.60 1999
The Sixth Sense 122 Buena Vista $672.80 1999
Brother Bear 584 Buena Vista $250.40 2003
Big Hero 6 127 Buena Vista $657.80 2014
Pirates of the Caribbean: The Curse of the Blac... 128 Buena Vista $654.30 2003
- I have also filtered the sorted DataFrame to show the top rows.
Counting values in a Series
- Selecting a particluar column from our DataFrame
movies['Studio']
Output:
>>> movies['Studio']
Title
Avengers: Endgame Buena Vista
Avatar Fox
Titanic Paramount
Star Wars: The Force Awakens Buena Vista
Avengers: Infinity War Buena Vista
...
Yogi Bear Warner Brothers
Garfield: The Movie Fox
Cats & Dogs Warner Brothers
The Hunt for Red October Paramount
Valkyrie MGM
Name: Studio, Length: 782, dtype: object
- Let’s count the number of times Studio value occurs in the Series
- .value_counts -> method that counts the count of the values inside the series
- It also takes ascending as an argument
movies['Studio'].value_counts().head(5)
Output:
>>> movies['Studio'].value_counts().head(5)
Warner Brothers 132
Buena Vista 125
Fox 117
Universal 109
Sony 86
Name: Studio, dtype: int64
Filtering Columns in your DataFrame
- The general idea is to create a condition logic which returns a boolean value, which is then used to filter the True conditions within a DataFrame
- We can also store the logic in a variable which then can be used for the filtering of your DataFrame
- Single Condition
released_by_universal = movies['Studio'] == 'Universal'
movies[released_by_universal]
Output:
>>> released_by_universal = movies['Studio'] == 'Universal'
>>> movies[released_by_universal]
Rank Studio Gross Year
Title
Jurassic World 6 Universal $1,671.70 2015
Furious 7 8 Universal $1,516.00 2015
Jurassic World: Fallen Kingdom 13 Universal $1,309.50 2018
The Fate of the Furious 17 Universal $1,236.00 2017
Minions 19 Universal $1,159.40 2015
... ... ... ... ...
The Break-Up 763 Universal $205.00 2006
Everest 766 Universal $203.40 2015
Patch Adams 772 Universal $202.30 1998
Kindergarten Cop 775 Universal $202.00 1990
Straight Outta Compton 776 Universal $201.60 2015
[109 rows x 4 columns]
- If we look at the output for the condition, then we can see it’s a Series object with values True or False
- Multiple Conditions:
- We can also use multiple conditions to filter our DataFrame
- We can use logical operators, & (and) | (or), to combine multiple conditions
released_by_universal = movies['Studio'] == 'Universal'
released_in_2015 = movies['Year'] == 2015
# Movies released by Universal AND in 2015
movies[released_by_universal & released_in_2015]
# Movies released by Universal OR in 2015
movies[released_by_universal | released_in_2015]
Additional methods for filtering the DataFrame
- Less than or Greater than a specific value (only works with numeric values)
- movies[“Year”] < 1900
- movies[“Year”] > 1900
- In between some range of values (only works with numeric values)
- movies[“Year”].between(1900, 2000)
- the arguments of the between method are inclusive
- Text value containing a specific string
- movies.index.str.lower().str.contains(“dark”)
Grouping data
Let’s try to figure out which is the highest-grossing studio. To do this, we first need to update the Gross column to a float value. To do this, we first need to remove “$” and “,” from that column.
Make sure to edit the column, not only output the result by setting the column to the new value.
movies['Gross'] = movies['Gross'].str.replace('$', '').str.replace(',', '').astype(float)
Now we have our Gross column data as float values, we can group our data by Studio. Please note, since we are grouping our data by a column we need to provide an aggregation method for the numeric columns. We can only include the columns which we need, but we need to provide aggregation for one of the columns.
movies.groupby('Studio').agg({'Gross': 'sum'}).sort_values('Gross', ascending=False).head()
Output:
>>> movies.groupby('Studio').agg({'Gross': 'sum'}).sort_values('Gross', ascending=False).head()
Gross
Studio
Buena Vista 73585.0
Warner Brothers 58643.8
Fox 50420.8
Universal 44302.3
Sony 32822.5
- We first group our DataFrame by “Studio”
- Then we apply sum aggregation to our “Gross” column
- Then we sort the values by “Gross” column in a descending way (ascending=False)
- Then we filter our data to show only the top 5 rows
We barely scratched the surface here, but at least this gives you an understanding of what exactly we can do with Pandas in Python. It is one of the most important and powerful tools to have in your tool kit. I will be writing more articles that go deeper into the topics. So keep a lookout for that.
Disclaimer: I have referred to the “Pandas in Action” book by Borris Pashkaver. It is one of the best books you can find out on this topic.