In this article, I will go over all the string manipulations we may need to do while using Pandas. Handling text data is one of the most confusing and memory-hungry processes, so finding the optimal and quick solution is a must. We will go over some of the use-cases I have come across and will talk about how to tackle those.
Take a look at the data
Before we proceed with our use-cases. Let’s quickly take a look at our data.
# Importing Libraries
import pandas as pd
# Importing Data
inspections = pd.read_csv('resources/chicago_food_inspections.csv')
>>> inspections
Name Risk
0 MARRIOT MARQUIS CHICAGO Risk 1 (High)
1 JETS PIZZA Risk 2 (Medium)
2 ROOM 1520 Risk 3 (Low)
3 MARRIOT MARQUIS CHICAGO Risk 1 (High)
4 CHARTWELLS Risk 1 (High)
... ... ...
153805 WOLCOTT'S Risk 1 (High)
153806 DUNKIN DONUTS/BASKIN-ROBBINS Risk 2 (Medium)
153807 Cafe 608 Risk 1 (High)
153808 mr.daniel's Risk 1 (High)
153809 TEMPO CAFE Risk 1 (High)
[153810 rows x 2 columns]
>>> inspections.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153810 entries, 0 to 153809
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 153810 non-null object
1 Risk 153744 non-null object
dtypes: object(2)
memory usage: 2.3+ MB
>>> inspections['Risk'].value_counts()
Risk 1 (High) 107351
Risk 2 (Medium) 31845
Risk 3 (Low) 14529
All 19
Name: Risk, dtype: int64
- We have data related to food inspections conducted accross the city of Chicago.
- There are 153,810 rows and two columns.
- Both the columns are of object type, which is nothing but string type.
- We can see that Risk column has some null values.
- Risk columns contains 4 main categories (and some null values)
Some of the issues we can see in our data:
- The letter casing seems to be different for the rows.
- There are whitespaces in the text values
Letter Casing
Having different casing for your text data can be very annoying. The casing of the text plays an important role when we need to compare strings, filter strings, or select strings. Having the same case strings speeds up the process and makes your life easier. So let’s see how can we update the casing of your text.
inspections['Name'] = inspections['Name'].str.lower()
>>> inspections['Name'] = inspections['Name'].str.lower()
>>> inspections['Name']
0 marriot marquis chicago
1 jets pizza
2 room 1520
3 marriot marquis chicago
4 chartwells
...
153805 wolcott's
153806 dunkin donuts/baskin-robbins
153807 cafe 608
153808 mr.daniel's
153809 tempo cafe
Name: Name, Length: 153810, dtype: object
Let’s walk through the code:
- We focus on the Name column and try to change the casing of its text values to lower by using str.lower().
- We first isolate the Name column from our DataFrame, which provides us with a Series object.
- We can then utilize the string methods of the Series object by calling .str
- We use the lower method from the str module and repalce the column with the new column.
There are other transformations provided by str:
- lower() -> This IS a Test => this is a test
- upper() -> this is a test => THIS IS A TEST
- capitalizes() -> this is a test => This is a test
- title() -> this is a test => This Is A Test
Removing White Spaces
Often times our string data has trailing or leading whitespace which can cause annoying issues when trying to filter or search for a specific text. Let’s devise a solution for that.
inspections['Name'] = inspections['Name'].str.strip()
>>> inspections['Name']
0 marriot marquis chicago
1 jets pizza
2 room 1520
3 marriot marquis chicago
4 chartwells
...
153805 wolcott's
153806 dunkin donuts/baskin-robbins
153807 cafe 608
153808 mr.daniel's
153809 tempo cafe
Name: Name, Length: 153810, dtype: object
There are three string methods available for removing the whitespace:
- lstrip() -> removes the white space from left side of the string
- rstrip() -> removes the white space from right side of the string
- strip() -> removes the white space from both sides of the string
All three methods take an optional to_strip parameter, to which we can pass a specified set of characters to be removed. All combinations of this set of characters will be stripped. If None, then whitespaces are removed.
The implementation is similar to the casing example above.
Applying the above transformation to multiple columns
Applying the above transformation to single columns is pretty straightforward. But when we have datasets with multiple columns and with various data types, it can be very time-consuming to apply these transformations to individual columns. Let’s take a look at how we can handle such a scenario.
object_columns = inspections.select_dtypes(include=['object']).columns
for _col in object_columns:
inspections[_col] = inspections[_col].str.lower().str.strip()
>>> inspections
Name Risk
0 marriot marquis chicago risk 1 (high)
1 jets pizza risk 2 (medium)
2 room 1520 risk 3 (low)
3 marriot marquis chicago risk 1 (high)
4 chartwells risk 1 (high)
... ... ...
153805 wolcott's risk 1 (high)
153806 dunkin donuts/baskin-robbins risk 2 (medium)
153807 cafe 608 risk 1 (high)
153808 mr.daniel's risk 1 (high)
153809 tempo cafe risk 1 (high)
Let’s walk through the code:
- We first create a list with all the columns with object data type. In our example we only have 2 columns, both with object data type. But we may have multiple columns with different data types.
- Then we iterate over our object column list and apply the transformations.
Replacing Strings
Before we start with the replacement, remember we had some null values in the Risk column. Let’s get rid of those null values.
inspections = inspections.dropna(subset=['Risk'])
>>> inspections.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 153744 entries, 0 to 153809
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 153744 non-null object
1 Risk 153744 non-null object
dtypes: object(2)
memory usage: 3.5+ MB
Let’s walk through the code:
- We drop all the rows where there is a null value present in the Risk column using dropna method.
- We pass a list of columns we want to check for null values as argument to the subset parameter.
- Upon inspecting we can see that now we don’t have any null values in Risk column
Let’s get on with replacing the string text now. There are 4 unique values in the Risk column, out of which one of the values, All, doesn’t line up with the rest of the values. So let’s find a way to replace that with “risk 4 (extreme)”
inspections['Risk'] = inspections['Risk'].replace({'all': 'risk 4 (extreme)'})
>>> inspections['Risk'].unique()
array(['risk 1 (high)', 'risk 2 (medium)', 'risk 3 (low)',
'risk 4 (extreme)'], dtype=object)
Let’s walk through the code:
- We use replace method for repalcing the values within our Risk column.
- We provide a dictionary of key values pairs, with key as the text we want to replace and value as the text we want to replace with.
- Note that we are repalcing “all” instead of “All” because we have already converted our strings to lower case in the above examples.
- Now when we look at the unique values, we can see that there are only 4 unique values in our Risk column.
Character Replacement
Replacing character(s) can have multiple use-cases. We may need to replace one character or multiple characters, at the beginning or at the end, or in the middle, it could be dynamic or static. Also, there could be multiple ways to get to the final result. I will try to cover all these scenarios here.
Extracting a single character at a given location
Let’s look at our Risk column, the values are risk 1 (high), risk 2 (medium), and so on. Let’s say we want to extract the number from the string. We can see that the number is always at a static position of index position 5, ie. the position of the number doesn’t change in every value. Let’s look at some ways we can extract that.
>>> inspections['Risk'].str.slice(5, 6).head()
0 1
1 2
2 3
3 1
4 1
Name: Risk, dtype: object
>>> inspections['Risk'].str[5:6].head()
0 1
1 2
2 3
3 1
4 1
Name: Risk, dtype: object
Let’s walk through the code:
- Our desired character is at the position index 5. Note that the indexes start at 0.
- We can either use the slice method provide by str module or we can directly slice the characters.
- Thing to note is that the upper bound is exclusive, ie. it is not included in the output.
Extracting multiple characters from the string
We can modify the above code to extract multiple characters. The above method involved provided a lower bound and an upper bound which translates to the first index and the second index.
If we only provide one index then the upper bound / last index is set to the last position by default. This way we can get all the characters starting from the provided index till last. Let’s say we want to extract “(low)” or “(medium)” from our string:
>>> inspections['Risk'].str.slice(7).head()
0 (high)
1 (medium)
2 (low)
3 (high)
4 (high)
Name: Risk, dtype: object
>>> inspections['Risk'].str[7:].head()
0 (high)
1 (medium)
2 (low)
3 (high)
4 (high)
Name: Risk, dtype: object
Let’s walk through the code:
- Since we only provide one index position (7), we will get all the characters starting from that position index till the end.
Knowing that the upper bound is exclusive, we could provide that in order to get a particular string of characters. Let’s say we only want the category from our string, “low”, “medium”, and so on.
>>> inspections['Risk'].str.slice(8, -1).head()
0 high
1 medium
2 low
3 high
4 high
Name: Risk, dtype: object
>>> inspections['Risk'].str[8:-1].head()
0 high
1 medium
2 low
3 high
4 high
Name: Risk, dtype: object
Let’s walk through the code:
- There is a notation we can use for providing the last position index, and that is ” -1 “
- Since the upper bound isn’t included our output is the string of charcters starting from the index position 8 till the last index but excluding it
Filtering using string methods
We can also use string methods to filter our DataFrame. The general idea is to define a logic that returns a series of Boolean values, which then can be used to filter our DataFrame.
String containing a set of characters
We can use contains method for filtering the rows which contain a specific set of characters. In our dataset, let’s say we want to filter the names which contain “pizza”.
Please note, strings are case sensitive that’s why at the beginning of this article we transformed all of our string values to lower case so that it’s easier for us to do logical operations on it.
contains_pizza = inspections['Name'].str.contains('pizza')
>>> contains_pizza.head()
0 False
1 True
2 False
3 False
4 False
Name: Name, dtype: bool
>>> inspections[contains_pizza].head()
Name Risk
1 jets pizza risk 2 (medium)
19 nancy's home of stuffed pizza risk 1 (high)
27 nary's grill & pizza ,inc. risk 1 (high)
29 narys grill & pizza risk 1 (high)
68 colutas pizza risk 1 (high)
String starting with or ending with a set of characters
We can also filter our dataset depending on the position of our string. Let’s say we want to filter our dataset which has names ending with “tacos” and starting with “tacos”. We can also combine multiple conditions if needed.
starts_with_tacos = inspections['Name'].str.startswith('tacos')
>>> inspections[starts_with_tacos].head()
Name Risk
69 tacos nietos risk 1 (high)
556 tacos el tio 2 inc. risk 1 (high)
675 tacos don gabino risk 1 (high)
958 tacos el tio 2 inc. risk 1 (high)
1036 tacos el tio 2 inc. risk 1 (high)
ends_with_tacos = inspections['Name'].str.endswith('tacos')
>>> inspections[ends_with_tacos].head()
Name Risk
382 lazo's tacos risk 1 (high)
569 lazo's tacos risk 1 (high)
2652 flying tacos risk 3 (low)
3250 jony's tacos risk 1 (high)
3812 paco's tacos risk 1 (high)
There are multiple ways you can filter strings by using regex functions. I will write a separate article on it, so keep an eye out for that.
Splitting strings
Pandas provide two methods for splitting the strings: split and rsplit. The only difference is that split starts the split from the start of the string whereas rsplit starts the split from the end (or right, that’s why the name rsplit).
Aside from the default argument, the delimiter, we can also provide another argument to the parameter n, which is the number of splits we want to make. By default, it’s set to -1 which means all the splits will be returned.
If we want the output in a DataFrame then we can set another parameter, expand, to True. By default, it’s set to False and our output is a Series object containing the list of strings.
Let’s try splitting our Risk column
>>> inspections['Risk'].str.split().head()
0 [risk, 1, (high)]
1 [risk, 2, (medium)]
2 [risk, 3, (low)]
3 [risk, 1, (high)]
4 [risk, 1, (high)]
Name: Risk, dtype: object
>>> inspections['Risk'].str.split("(").head()
0 [risk 1 , high)]
1 [risk 2 , medium)]
2 [risk 3 , low)]
3 [risk 1 , high)]
4 [risk 1 , high)]
Name: Risk, dtype: object
>>> inspections['Risk'].str.rsplit(n=1, expand=True).head()
0 1
0 risk 1 (high)
1 risk 2 (medium)
2 risk 3 (low)
3 risk 1 (high)
4 risk 1 (high)
Let’s walk through our code:
- split()
- Since we didn’t provide any pattern, it splits the string by whitespaces.
- There were 3 whitespaces so we end up with 3 values in the list.
- split(“(“)
- Here we defined the delimeter “(“, so it splits the string using that delimeter.
- rsplit(n=1, expand=True)
- The string will be seperated from the right using the whitespace as the delimeter.
- We specifed that we need only 1 split, so after the first split it returns the entire string.
- We also specified that we need a DataFrame as an output using the expand parameter and setting it to True.
Converting String to Category
Pandas provide a special data type called category for storing string values. A category is useful when you have many duplicate strings inside a column. Some examples are blood groups, gender, department, marital status, etc.
In our dataset, we can see that the Risk column has 4 unique values which are repeated across all the rows. Let’s convert this column to a category datatype.
inspections['Risk'] = inspections['Risk'].astype('category')
>>> inspections['Risk'].head()
0 risk 1 (high)
1 risk 2 (medium)
2 risk 3 (low)
3 risk 1 (high)
4 risk 1 (high)
Name: Risk, dtype: category
Categories (4, object): ['risk 1 (high)', 'risk 2 (medium)', 'risk 3 (low)', 'risk 4 (extreme)']
We can see that the data type is category, which holds 4 object values. Internally, pandas store just one unique value and then refer to it. This speeds up the process and saves us a lot of memory.
We can now simply replace the categories with our desired result by using the .cat.rename_categories method.
inspections['Risk'] =inspections['Risk'].cat.rename_categories({
'risk 1 (high)': 'high',
'risk 2 (medium)': 'medium',
'risk 3 (low)': 'low',
'risk 4 (extreme)': 'extreme',
})
>>> inspections['Risk'].head()
0 high
1 medium
2 low
3 high
4 high
Name: Risk, dtype: category
Categories (4, object): ['high', 'medium', 'low', 'extreme']
Let’s walk through the code:
- We used .cat.rename_categories method for renaming the categories.
- We provided a dictionary with a key – value pair, where the key is the text you want to replace and the value is the text you want to replace with.
- We could also rename these to integer values and convert the column to int if needed.
Hopefully, I was able to cover the majority of the text-related use-cases in this article. Feel free to reach out if I missed anything. Again, this isn’t an in-depth tutorial for all the text-related functions provided by Pandas. This is just to get you started and answer some frequently asked questions.