October 16, 2024

Series: Building Block for Pandas – An Overview

Spread the love

Series is a one-dimensional labeled array for homogenous data, which means it can only store one type of data. Pandas assign each series value with a label (an identifier used to locate the value) and an order (position in the line). The index in Padas starts at 0, which means the first value occupies the position 0.

Series combines the best of Lists and Dictionary. Like a list, it holds the values in a sequenced order and like a dictionary, it provides us with a key/label for locating the value in the list.


Overview

Let’s see what are the arguments needed for creating a series within pandas

ParameterDescription
dataarray-like, Iterable, dict, or scalar value
Contains data stored in Series. If data is a dict, argument order is maintained.
indexarray-like or Index (1d)
Values must be hashable and have the same length as data. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, …, n) if not provided. If data is dict-like and the index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values.
dtypestr, numpy.dtype, or ExtensionDtype, optional
The data type for the output Series. If not specified, this will be inferred from data.
namestr, optional
The name to give to the Series.
copybool, default False
Copy input data. Only affects Series or 1d ndarray input.

Numpy dependency

Pandas Series object is dependent on numpy for storing the values of the series object. Let’s take a look at it.


calorie_info = {
    "Cereal": 125,
    "Chocolate Bar": 406,
    "Ice Cream Sundae": 342,
}
diet = pd.Series(calorie_info)

>>> type(diet.values)
<class 'numpy.ndarray'>

>>> type(diet.index)
<class 'pandas.core.indexes.base.Index'>
  • Here we can see that the values inside the series is stored as a numpy array.
    • ndarray object optimizes for speed and efficiency by relying on the lower level C programming language for many of its calculations.
  • The index values are stored in a pandas object Index

General Attributes of Series

AttributeDescription
.valuereturns ndarray of values
.indexreturns index object of indexes
.dtypetype of values inside the series
– note that series can only store one type of data
.sizereturns the total size of the series
.shapereturns a tuple of rows and cols
– columns for a series object is always 1 so it returns (no_rows, )
.is_uniquereturns True if values are unique, else returns False
.is_monotonicreturns True if each value is greater than or equal to the previous one (increments do not have to be equal), else returns False

Statistical Operations

Let’s briefly look at some of the statistical operations available for series

OperationDescription
.count()returns the count of non-null values
.sum()returns the sum of non-null values
– ignores null values by default, but we can change this behavior by setting skipna=False, in this case, it will return NaN value
– we can also set the minimum threshold for valid values by setting min_count=x, this will return the sum if x number of valid values are present in the series
.product()returns the multiplication of all the non-null values
skipna and min_count attributes are available
.cumsum()returns a series with cumulative sum as values
– if skipna=False, then return NaN for all the rows after the first encounter of NaN else just returns NaN for that particular row
.pct_change()returns the percentage difference from one value to the next
– fill_method = “pad” / “ffill” / “bfill” / “backfill”
– pad / ffill -> forward fill -> default -> replaces NaN with the last valid value
– bfill / backfill -> backward fill -> replaces NaN with the next valid value
.mean()returns mean of all the values
.median()returns median of all the values
.std()returns standard deviation of all the values
.max()returns the maximum value from all the values
.min() returns the minimum value from all the values
.describe()returns a series with count, mean, std, min, 25%, 50%, 75%, max values
.sample()returns a random sample from the series, default=1
.unique()returns an array with unique values from the series
.nunique()returns the count of unique values from the series

Arithmetic Operations

Arithmetic operations in Series is an element-wise operation, which means the operation is executed on the element level.

OperationMethods
Addition• series + X
• series.add(X)
Subtraction• series – X
• series.sub(X)
• series.substract(X)
Multiplication• series * X
• series.mul(X)
• series.multiply(X)
Division• series / X
• series.div(X)
• series.divide(X)
Floor Division• series // X
• sereis.floordiv(X)
Modulo• series % X
• series.mod(X)
  • Here, X can be a single value or another list.
  • When X is a single value, then the operation is executed on all the values inside the series
  • When X is another sereis, then the operation is executed on the matching indexes.
    • If one of the series has more values than another, then the operation returns NaN

s1 = pd.Series([1, 2, np.NAN, 4])
s2 = pd.Series([5, np.NAN, 6, 7, 8])

>>> s1 + s2
0     6.0
1     NaN
2     NaN
3    11.0
4     NaN
dtype: float64

>>> s1 - s2
0   -4.0
1    NaN
2    NaN
3   -3.0
4    NaN
dtype: float64

>>> s1 * s2
0     5.0
1     NaN
2     NaN
3    28.0
4     NaN
dtype: float64

>>> s1 / s2
0    0.200000
1         NaN
2         NaN
3    0.571429
4         NaN
dtype: float64

>>> s1 // s2
0    0.0
1    NaN
2    NaN
3    0.0
4    NaN
dtype: float64

>>> s1 % s2
0    1.0
1    NaN
2    NaN
3    4.0
4    NaN
dtype: float64

Sorting Series

We can use sort_value or sort_index methods for this sorting by values or index respectively. Let’s look at the parameters which can be passed to this method.
There are other parameters that can be used, you can find those in the documentation here.

ParametersDescription
ascendingbool or list of bools, default True
If True, sort values in ascending order, otherwise descending.
na_position{‘first’ or ‘last’}, default ‘last’
Argument ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end.
inplacebool, default False
If True, perform operation in-place.
  • We can overwrite the series by using the inplace parameter and setting it to True. But in the back end pandas creates a copy of the series and then reassigns the variable to a new object. Thus, contrary to popular belief, the inplace doesn’t offer any perfomance benefits.

Getting the smallest and largest values

We can simply sort our series in ascending or descending order and then limit the result to output the desired number of rows. But since this is a very common query, pandas have offered dedicated methods for this purpose: nlargest and nsmallest. By default, 5 rows are returned, but you can modify this behavior by passing an integer value as an argument. Here is a small example:

>>> diet.nlargest(2)
Chocolate Bar       406
Ice Cream Sundae    342
dtype: int64

>>> diet.nsmallest(1)
Cereal    125
dtype: int64

Couting values in a series

We can use the value_counts method, which counts the number of occurrences of each series value. The parameters for this method are:

ParametersDescription
normalizebool, default False
If True then the object returned will contain the relative frequencies of the unique values.
sortbool, default True
Sort by frequencies.
ascendingbool, default False
Sort in ascending order.
binsint, optional
Rather than count values, group them into half-open bins, a convenience for pd.cut, only works with numeric data.
dropnabool, default True
Don’t include counts of NaN.

Apply method

By using the apply method, we can invoke a function on every series value.

Let’s say in our diet series, we need to find if our food falls under a high calory diet or low. We can simply define a function that takes care of the logic and then apply that function to each value within the series.

def high_or_low_calories(calory):
    return 'High' if calory > 150 else 'Low'

>>> diet
Cereal              125
Chocolate Bar       406
Ice Cream Sundae    342
dtype: int64

>>> diet.apply(high_or_low_calories)
Cereal               Low
Chocolate Bar       High
Ice Cream Sundae    High
dtype: object

This concludes the basic foundation of Series within Pandas. I will add additional posts for any advanced use cases as I come across them, so keep an eye out for that.


Spread the love

Leave a Reply

Your email address will not be published. Required fields are marked *