December 11, 2024

Pandas – Working with Text: Everything to do with strings and pandas

Spread the love

In this article, I will go over all the string manipulations we may need to do while using Pandas. Handling text data is one of the most confusing and memory-hungry processes, so finding the optimal and quick solution is a must. We will go over some of the use-cases I have come across and will talk about how to tackle those.


Take a look at the data

Before we proceed with our use-cases. Let’s quickly take a look at our data.

# Importing Libraries
import pandas as pd

# Importing Data
inspections = pd.read_csv('resources/chicago_food_inspections.csv')

>>> inspections
                                      Name             Risk
0               MARRIOT MARQUIS CHICAGO       Risk 1 (High)
1                              JETS PIZZA   Risk 2 (Medium)
2                               ROOM 1520      Risk 3 (Low)
3                MARRIOT MARQUIS CHICAGO      Risk 1 (High)
4                            CHARTWELLS       Risk 1 (High)
...                                    ...              ...
153805                         WOLCOTT'S      Risk 1 (High)
153806     DUNKIN DONUTS/BASKIN-ROBBINS     Risk 2 (Medium)
153807                           Cafe 608     Risk 1 (High)
153808                        mr.daniel's     Risk 1 (High)
153809                         TEMPO CAFE     Risk 1 (High)

[153810 rows x 2 columns]

>>> inspections.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153810 entries, 0 to 153809
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   Name    153810 non-null  object
 1   Risk    153744 non-null  object
dtypes: object(2)
memory usage: 2.3+ MB

>>> inspections['Risk'].value_counts()
Risk 1 (High)      107351
Risk 2 (Medium)     31845
Risk 3 (Low)        14529
All                    19
Name: Risk, dtype: int64
  • We have data related to food inspections conducted accross the city of Chicago.
  • There are 153,810 rows and two columns.
  • Both the columns are of object type, which is nothing but string type.
  • We can see that Risk column has some null values.
  • Risk columns contains 4 main categories (and some null values)

Some of the issues we can see in our data:

  • The letter casing seems to be different for the rows.
  • There are whitespaces in the text values

Letter Casing

Having different casing for your text data can be very annoying. The casing of the text plays an important role when we need to compare strings, filter strings, or select strings. Having the same case strings speeds up the process and makes your life easier. So let’s see how can we update the casing of your text.

inspections['Name'] = inspections['Name'].str.lower()

>>> inspections['Name'] = inspections['Name'].str.lower()
>>> inspections['Name']
0                 marriot marquis chicago   
1                                jets pizza
2                                 room 1520
3                  marriot marquis chicago
4                              chartwells
                         ...
153805                           wolcott's
153806       dunkin donuts/baskin-robbins
153807                             cafe 608
153808                          mr.daniel's
153809                           tempo cafe
Name: Name, Length: 153810, dtype: object

Let’s walk through the code:

  • We focus on the Name column and try to change the casing of its text values to lower by using str.lower().
  • We first isolate the Name column from our DataFrame, which provides us with a Series object.
  • We can then utilize the string methods of the Series object by calling .str
  • We use the lower method from the str module and repalce the column with the new column.

There are other transformations provided by str:

  • lower() -> This IS a Test => this is a test
  • upper() -> this is a test => THIS IS A TEST
  • capitalizes() -> this is a test => This is a test
  • title() -> this is a test => This Is A Test

Removing White Spaces

Often times our string data has trailing or leading whitespace which can cause annoying issues when trying to filter or search for a specific text. Let’s devise a solution for that.

inspections['Name'] = inspections['Name'].str.strip()
>>> inspections['Name']
0              marriot marquis chicago
1                           jets pizza
2                            room 1520
3              marriot marquis chicago
4                           chartwells
                      ...
153805                       wolcott's
153806    dunkin donuts/baskin-robbins
153807                        cafe 608
153808                     mr.daniel's
153809                      tempo cafe
Name: Name, Length: 153810, dtype: object

There are three string methods available for removing the whitespace:

  • lstrip() -> removes the white space from left side of the string
  • rstrip() -> removes the white space from right side of the string
  • strip() -> removes the white space from both sides of the string

All three methods take an optional to_strip parameter, to which we can pass a specified set of characters to be removed. All combinations of this set of characters will be stripped. If None, then whitespaces are removed.

The implementation is similar to the casing example above.


Applying the above transformation to multiple columns

Applying the above transformation to single columns is pretty straightforward. But when we have datasets with multiple columns and with various data types, it can be very time-consuming to apply these transformations to individual columns. Let’s take a look at how we can handle such a scenario.

object_columns = inspections.select_dtypes(include=['object']).columns
for _col in object_columns:
    inspections[_col] = inspections[_col].str.lower().str.strip()
>>> inspections
                                Name             Risk
0            marriot marquis chicago    risk 1 (high)
1                         jets pizza  risk 2 (medium)
2                          room 1520     risk 3 (low)
3            marriot marquis chicago    risk 1 (high)
4                         chartwells    risk 1 (high)
...                              ...              ...
153805                     wolcott's    risk 1 (high)
153806  dunkin donuts/baskin-robbins  risk 2 (medium)
153807                      cafe 608    risk 1 (high)
153808                   mr.daniel's    risk 1 (high)
153809                    tempo cafe    risk 1 (high)

Let’s walk through the code:

  • We first create a list with all the columns with object data type. In our example we only have 2 columns, both with object data type. But we may have multiple columns with different data types.
  • Then we iterate over our object column list and apply the transformations.

Replacing Strings

Before we start with the replacement, remember we had some null values in the Risk column. Let’s get rid of those null values.

inspections = inspections.dropna(subset=['Risk'])
>>> inspections.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 153744 entries, 0 to 153809
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   Name    153744 non-null  object
 1   Risk    153744 non-null  object
dtypes: object(2)
memory usage: 3.5+ MB

Let’s walk through the code:

  • We drop all the rows where there is a null value present in the Risk column using dropna method.
  • We pass a list of columns we want to check for null values as argument to the subset parameter.
  • Upon inspecting we can see that now we don’t have any null values in Risk column

Let’s get on with replacing the string text now. There are 4 unique values in the Risk column, out of which one of the values, All, doesn’t line up with the rest of the values. So let’s find a way to replace that with “risk 4 (extreme)”

inspections['Risk'] = inspections['Risk'].replace({'all': 'risk 4 (extreme)'})
>>> inspections['Risk'].unique()
array(['risk 1 (high)', 'risk 2 (medium)', 'risk 3 (low)',
       'risk 4 (extreme)'], dtype=object)

Let’s walk through the code:

  • We use replace method for repalcing the values within our Risk column.
  • We provide a dictionary of key values pairs, with key as the text we want to replace and value as the text we want to replace with.
  • Note that we are repalcing “all” instead of “All” because we have already converted our strings to lower case in the above examples.
  • Now when we look at the unique values, we can see that there are only 4 unique values in our Risk column.

Character Replacement

Replacing character(s) can have multiple use-cases. We may need to replace one character or multiple characters, at the beginning or at the end, or in the middle, it could be dynamic or static. Also, there could be multiple ways to get to the final result. I will try to cover all these scenarios here.


Extracting a single character at a given location

Let’s look at our Risk column, the values are risk 1 (high), risk 2 (medium), and so on. Let’s say we want to extract the number from the string. We can see that the number is always at a static position of index position 5, ie. the position of the number doesn’t change in every value. Let’s look at some ways we can extract that.

>>> inspections['Risk'].str.slice(5, 6).head()
0    1
1    2
2    3
3    1
4    1
Name: Risk, dtype: object

>>> inspections['Risk'].str[5:6].head()
0    1
1    2
2    3
3    1
4    1
Name: Risk, dtype: object

Let’s walk through the code:

  • Our desired character is at the position index 5. Note that the indexes start at 0.
  • We can either use the slice method provide by str module or we can directly slice the characters.
  • Thing to note is that the upper bound is exclusive, ie. it is not included in the output.

Extracting multiple characters from the string

We can modify the above code to extract multiple characters. The above method involved provided a lower bound and an upper bound which translates to the first index and the second index.

If we only provide one index then the upper bound / last index is set to the last position by default. This way we can get all the characters starting from the provided index till last. Let’s say we want to extract “(low)” or “(medium)” from our string:

>>> inspections['Risk'].str.slice(7).head()
0      (high)
1    (medium)
2       (low)
3      (high)
4      (high)
Name: Risk, dtype: object

>>> inspections['Risk'].str[7:].head()
0      (high)
1    (medium)
2       (low)
3      (high)
4      (high)
Name: Risk, dtype: object

Let’s walk through the code:

  • Since we only provide one index position (7), we will get all the characters starting from that position index till the end.

Knowing that the upper bound is exclusive, we could provide that in order to get a particular string of characters. Let’s say we only want the category from our string, “low”, “medium”, and so on.

>>> inspections['Risk'].str.slice(8, -1).head()
0      high
1    medium
2       low
3      high
4      high
Name: Risk, dtype: object

>>> inspections['Risk'].str[8:-1].head()
0      high
1    medium
2       low
3      high
4      high
Name: Risk, dtype: object

Let’s walk through the code:

  • There is a notation we can use for providing the last position index, and that is ” -1 “
  • Since the upper bound isn’t included our output is the string of charcters starting from the index position 8 till the last index but excluding it

Filtering using string methods

We can also use string methods to filter our DataFrame. The general idea is to define a logic that returns a series of Boolean values, which then can be used to filter our DataFrame.


String containing a set of characters

We can use contains method for filtering the rows which contain a specific set of characters. In our dataset, let’s say we want to filter the names which contain “pizza”.
Please note, strings are case sensitive that’s why at the beginning of this article we transformed all of our string values to lower case so that it’s easier for us to do logical operations on it.

contains_pizza = inspections['Name'].str.contains('pizza')
>>> contains_pizza.head()
0    False
1     True
2    False
3    False
4    False
Name: Name, dtype: bool

>>> inspections[contains_pizza].head()
                             Name             Risk
1                      jets pizza  risk 2 (medium)
19  nancy's home of stuffed pizza    risk 1 (high)
27     nary's grill & pizza ,inc.    risk 1 (high)
29            narys grill & pizza    risk 1 (high)
68                  colutas pizza    risk 1 (high)

String starting with or ending with a set of characters

We can also filter our dataset depending on the position of our string. Let’s say we want to filter our dataset which has names ending with “tacos” and starting with “tacos”. We can also combine multiple conditions if needed.

starts_with_tacos = inspections['Name'].str.startswith('tacos')
>>> inspections[starts_with_tacos].head()
                     Name           Risk
69           tacos nietos  risk 1 (high)
556   tacos el tio 2 inc.  risk 1 (high)
675      tacos don gabino  risk 1 (high)
958   tacos el tio 2 inc.  risk 1 (high)
1036  tacos el tio 2 inc.  risk 1 (high)

ends_with_tacos = inspections['Name'].str.endswith('tacos')
>>> inspections[ends_with_tacos].head()
              Name           Risk
382   lazo's tacos  risk 1 (high)
569   lazo's tacos  risk 1 (high)
2652  flying tacos   risk 3 (low)
3250  jony's tacos  risk 1 (high)
3812  paco's tacos  risk 1 (high)

There are multiple ways you can filter strings by using regex functions. I will write a separate article on it, so keep an eye out for that.


Splitting strings

Pandas provide two methods for splitting the strings: split and rsplit. The only difference is that split starts the split from the start of the string whereas rsplit starts the split from the end (or right, that’s why the name rsplit).
Aside from the default argument, the delimiter, we can also provide another argument to the parameter n, which is the number of splits we want to make. By default, it’s set to -1 which means all the splits will be returned.
If we want the output in a DataFrame then we can set another parameter, expand, to True. By default, it’s set to False and our output is a Series object containing the list of strings.

Let’s try splitting our Risk column

>>> inspections['Risk'].str.split().head()
0      [risk, 1, (high)]
1    [risk, 2, (medium)]
2       [risk, 3, (low)]
3      [risk, 1, (high)]
4      [risk, 1, (high)]
Name: Risk, dtype: object

>>> inspections['Risk'].str.split("(").head()
0      [risk 1 , high)]
1    [risk 2 , medium)]
2       [risk 3 , low)]
3      [risk 1 , high)]
4      [risk 1 , high)]
Name: Risk, dtype: object

>>> inspections['Risk'].str.rsplit(n=1, expand=True).head()
        0         1
0  risk 1    (high)
1  risk 2  (medium)
2  risk 3     (low)
3  risk 1    (high)
4  risk 1    (high)

Let’s walk through our code:

  • split()
    • Since we didn’t provide any pattern, it splits the string by whitespaces.
    • There were 3 whitespaces so we end up with 3 values in the list.
  • split(“(“)
    • Here we defined the delimeter “(“, so it splits the string using that delimeter.
  • rsplit(n=1, expand=True)
    • The string will be seperated from the right using the whitespace as the delimeter.
    • We specifed that we need only 1 split, so after the first split it returns the entire string.
    • We also specified that we need a DataFrame as an output using the expand parameter and setting it to True.

Converting String to Category

Pandas provide a special data type called category for storing string values. A category is useful when you have many duplicate strings inside a column. Some examples are blood groups, gender, department, marital status, etc.

In our dataset, we can see that the Risk column has 4 unique values which are repeated across all the rows. Let’s convert this column to a category datatype.

inspections['Risk'] = inspections['Risk'].astype('category')
>>> inspections['Risk'].head()
0      risk 1 (high)
1    risk 2 (medium)
2       risk 3 (low)
3      risk 1 (high)
4      risk 1 (high)
Name: Risk, dtype: category
Categories (4, object): ['risk 1 (high)', 'risk 2 (medium)', 'risk 3 (low)', 'risk 4 (extreme)']

We can see that the data type is category, which holds 4 object values. Internally, pandas store just one unique value and then refer to it. This speeds up the process and saves us a lot of memory.


We can now simply replace the categories with our desired result by using the .cat.rename_categories method.

inspections['Risk'] =inspections['Risk'].cat.rename_categories({
    'risk 1 (high)': 'high',
    'risk 2 (medium)': 'medium',
    'risk 3 (low)': 'low',
    'risk 4 (extreme)': 'extreme',
})
>>> inspections['Risk'].head()
0      high
1    medium
2       low
3      high
4      high
Name: Risk, dtype: category
Categories (4, object): ['high', 'medium', 'low', 'extreme']

Let’s walk through the code:

  • We used .cat.rename_categories method for renaming the categories.
  • We provided a dictionary with a key – value pair, where the key is the text you want to replace and the value is the text you want to replace with.
  • We could also rename these to integer values and convert the column to int if needed.

Hopefully, I was able to cover the majority of the text-related use-cases in this article. Feel free to reach out if I missed anything. Again, this isn’t an in-depth tutorial for all the text-related functions provided by Pandas. This is just to get you started and answer some frequently asked questions.


Spread the love

Leave a Reply

Your email address will not be published. Required fields are marked *