Series: Building Block for Pandas – An Overview

Spread the love

Series is a one-dimensional labeled array for homogenous data, which means it can only store one type of data. Pandas assign each series value with a label (an identifier used to locate the value) and an order (position in the line). The index in Padas starts at 0, which means the first value occupies the position 0.

Series combines the best of Lists and Dictionary. Like a list, it holds the values in a sequenced order and like a dictionary, it provides us with a key/label for locating the value in the list.

Overview

Let’s see what are the arguments needed for creating a series within pandas

Parameter	Description
data	array-like, Iterable, dict, or scalar value Contains data stored in Series. If data is a dict, argument order is maintained.
index	array-like or Index (1d) Values must be hashable and have the same length as data. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, …, n) if not provided. If data is dict-like and the index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values.
dtype	str, numpy.dtype, or ExtensionDtype, optional The data type for the output Series. If not specified, this will be inferred from data.
name	str, optional The name to give to the Series.
copy	bool, default False Copy input data. Only affects Series or 1d ndarray input.

Numpy dependency

Pandas Series object is dependent on numpy for storing the values of the series object. Let’s take a look at it.


calorie_info = {
    "Cereal": 125,
    "Chocolate Bar": 406,
    "Ice Cream Sundae": 342,
}
diet = pd.Series(calorie_info)

>>> type(diet.values)
<class 'numpy.ndarray'>

>>> type(diet.index)
<class 'pandas.core.indexes.base.Index'>

Here we can see that the values inside the series is stored as a numpy array.
- ndarray object optimizes for speed and efficiency by relying on the lower level C programming language for many of its calculations.
The index values are stored in a pandas object Index

General Attributes of Series

Attribute	Description
.value	returns ndarray of values
.index	returns index object of indexes
.dtype	type of values inside the series – note that series can only store one type of data
.size	returns the total size of the series
.shape	returns a tuple of rows and cols – columns for a series object is always 1 so it returns (no_rows, )
.is_unique	returns True if values are unique, else returns False
.is_monotonic	returns True if each value is greater than or equal to the previous one (increments do not have to be equal), else returns False

Statistical Operations

Let’s briefly look at some of the statistical operations available for series

Operation	Description
.count()	returns the count of non-null values
.sum()	returns the sum of non-null values – ignores null values by default, but we can change this behavior by setting skipna=False, in this case, it will return NaN value – we can also set the minimum threshold for valid values by setting min_count=x, this will return the sum if x number of valid values are present in the series
.product()	returns the multiplication of all the non-null values – skipna and min_count attributes are available
.cumsum()	returns a series with cumulative sum as values – if skipna=False, then return NaN for all the rows after the first encounter of NaN else just returns NaN for that particular row
.pct_change()	returns the percentage difference from one value to the next – fill_method = “pad” / “ffill” / “bfill” / “backfill” – pad / ffill -> forward fill -> default -> replaces NaN with the last valid value – bfill / backfill -> backward fill -> replaces NaN with the next valid value
.mean()	returns mean of all the values
.median()	returns median of all the values
.std()	returns standard deviation of all the values
.max()	returns the maximum value from all the values
.min()	returns the minimum value from all the values
.describe()	returns a series with count, mean, std, min, 25%, 50%, 75%, max values
.sample()	returns a random sample from the series, default=1
.unique()	returns an array with unique values from the series
.nunique()	returns the count of unique values from the series

Arithmetic Operations

Arithmetic operations in Series is an element-wise operation, which means the operation is executed on the element level.

Operation	Methods
Addition	• series + X • series.add(X)
Subtraction	• series – X • series.sub(X) • series.substract(X)
Multiplication	• series * X • series.mul(X) • series.multiply(X)
Division	• series / X • series.div(X) • series.divide(X)
Floor Division	• series // X • sereis.floordiv(X)
Modulo	• series % X • series.mod(X)

Here, X can be a single value or another list.
When X is a single value, then the operation is executed on all the values inside the series
When X is another sereis, then the operation is executed on the matching indexes.
- If one of the series has more values than another, then the operation returns NaN


s1 = pd.Series([1, 2, np.NAN, 4])
s2 = pd.Series([5, np.NAN, 6, 7, 8])

>>> s1 + s2
0     6.0
1     NaN
2     NaN
3    11.0
4     NaN
dtype: float64

>>> s1 - s2
0   -4.0
1    NaN
2    NaN
3   -3.0
4    NaN
dtype: float64

>>> s1 * s2
0     5.0
1     NaN
2     NaN
3    28.0
4     NaN
dtype: float64

>>> s1 / s2
0    0.200000
1         NaN
2         NaN
3    0.571429
4         NaN
dtype: float64

>>> s1 // s2
0    0.0
1    NaN
2    NaN
3    0.0
4    NaN
dtype: float64

>>> s1 % s2
0    1.0
1    NaN
2    NaN
3    4.0
4    NaN
dtype: float64

Sorting Series

We can use sort_value or sort_index methods for this sorting by values or index respectively. Let’s look at the parameters which can be passed to this method.
There are other parameters that can be used, you can find those in the documentation here.

Parameters	Description
ascending	bool or list of bools, default True If True, sort values in ascending order, otherwise descending.
na_position	{‘first’ or ‘last’}, default ‘last’ Argument ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end.
inplace	bool, default False If True, perform operation in-place.

We can overwrite the series by using the inplace parameter and setting it to True. But in the back end pandas creates a copy of the series and then reassigns the variable to a new object. Thus, contrary to popular belief, the inplace doesn’t offer any perfomance benefits.

Getting the smallest and largest values

We can simply sort our series in ascending or descending order and then limit the result to output the desired number of rows. But since this is a very common query, pandas have offered dedicated methods for this purpose: nlargest and nsmallest. By default, 5 rows are returned, but you can modify this behavior by passing an integer value as an argument. Here is a small example:

>>> diet.nlargest(2)
Chocolate Bar       406
Ice Cream Sundae    342
dtype: int64

>>> diet.nsmallest(1)
Cereal    125
dtype: int64

Couting values in a series

We can use the value_counts method, which counts the number of occurrences of each series value. The parameters for this method are:

Parameters	Description
normalize	bool, default False If True then the object returned will contain the relative frequencies of the unique values.
sort	bool, default True Sort by frequencies.
ascending	bool, default False Sort in ascending order.
bins	int, optional Rather than count values, group them into half-open bins, a convenience for pd.cut, only works with numeric data.
dropna	bool, default True Don’t include counts of NaN.

Apply method

By using the apply method, we can invoke a function on every series value.

Let’s say in our diet series, we need to find if our food falls under a high calory diet or low. We can simply define a function that takes care of the logic and then apply that function to each value within the series.

def high_or_low_calories(calory):
    return 'High' if calory > 150 else 'Low'

>>> diet
Cereal              125
Chocolate Bar       406
Ice Cream Sundae    342
dtype: int64

>>> diet.apply(high_or_low_calories)
Cereal               Low
Chocolate Bar       High
Ice Cream Sundae    High
dtype: object

This concludes the basic foundation of Series within Pandas. I will add additional posts for any advanced use cases as I come across them, so keep an eye out for that.

Spread the love

Series: Building Block for Pandas – An Overview

Overview

Numpy dependency

General Attributes of Series

Statistical Operations

Arithmetic Operations

Sorting Series

Getting the smallest and largest values

Couting values in a series

Apply method

Preet Parmar

Leave a Reply Cancel reply

SQL Simplified: Let’s go back to the basics

Handling customers with same name

“useState”: The most commonly used hook in React

Python DefaultDict – Ability to assign default values to your keys

Feature Importance: a special use case of Random Forest Classifier

Feature Importance: a special use case of Random Forest Classifier

SQLite: Integrating Python and SQL

NamedTuple: A pythonic way for writing your code

Live Weather using BeautifulSoup

Dropbox API: Automating Dropbox downloads using Python

Overview

Numpy dependency

General Attributes of Series

Statistical Operations

Arithmetic Operations

Sorting Series

Getting the smallest and largest values

Couting values in a series

Apply method

Preet Parmar

You might also like

Leave a Reply Cancel reply