Series is a one-dimensional labeled array for homogenous data, which means it can only store one type of data. Pandas assign each series value with a label (an identifier used to locate the value) and an order (position in the line). The index in Padas starts at 0, which means the first value occupies the position 0.
Series combines the best of Lists and Dictionary. Like a list, it holds the values in a sequenced order and like a dictionary, it provides us with a key/label for locating the value in the list.
Overview
Let’s see what are the arguments needed for creating a series within pandas
Parameter | Description |
---|---|
data | array-like, Iterable, dict, or scalar value Contains data stored in Series. If data is a dict, argument order is maintained. |
index | array-like or Index (1d) Values must be hashable and have the same length as data. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, …, n) if not provided. If data is dict-like and the index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values. |
dtype | str, numpy.dtype, or ExtensionDtype, optional The data type for the output Series. If not specified, this will be inferred from data. |
name | str, optional The name to give to the Series. |
copy | bool, default False Copy input data. Only affects Series or 1d ndarray input. |
Numpy dependency
Pandas Series object is dependent on numpy for storing the values of the series object. Let’s take a look at it.
calorie_info = {
"Cereal": 125,
"Chocolate Bar": 406,
"Ice Cream Sundae": 342,
}
diet = pd.Series(calorie_info)
>>> type(diet.values)
<class 'numpy.ndarray'>
>>> type(diet.index)
<class 'pandas.core.indexes.base.Index'>
- Here we can see that the values inside the series is stored as a numpy array.
- ndarray object optimizes for speed and efficiency by relying on the lower level C programming language for many of its calculations.
- The index values are stored in a pandas object Index
General Attributes of Series
Attribute | Description |
---|---|
.value | returns ndarray of values |
.index | returns index object of indexes |
.dtype | type of values inside the series – note that series can only store one type of data |
.size | returns the total size of the series |
.shape | returns a tuple of rows and cols – columns for a series object is always 1 so it returns (no_rows, ) |
.is_unique | returns True if values are unique, else returns False |
.is_monotonic | returns True if each value is greater than or equal to the previous one (increments do not have to be equal), else returns False |
Statistical Operations
Let’s briefly look at some of the statistical operations available for series
Operation | Description |
---|---|
.count() | returns the count of non-null values |
.sum() | returns the sum of non-null values – ignores null values by default, but we can change this behavior by setting skipna=False, in this case, it will return NaN value – we can also set the minimum threshold for valid values by setting min_count=x, this will return the sum if x number of valid values are present in the series |
.product() | returns the multiplication of all the non-null values – skipna and min_count attributes are available |
.cumsum() | returns a series with cumulative sum as values – if skipna=False, then return NaN for all the rows after the first encounter of NaN else just returns NaN for that particular row |
.pct_change() | returns the percentage difference from one value to the next – fill_method = “pad” / “ffill” / “bfill” / “backfill” – pad / ffill -> forward fill -> default -> replaces NaN with the last valid value – bfill / backfill -> backward fill -> replaces NaN with the next valid value |
.mean() | returns mean of all the values |
.median() | returns median of all the values |
.std() | returns standard deviation of all the values |
.max() | returns the maximum value from all the values |
.min() | returns the minimum value from all the values |
.describe() | returns a series with count, mean, std, min, 25%, 50%, 75%, max values |
.sample() | returns a random sample from the series, default=1 |
.unique() | returns an array with unique values from the series |
.nunique() | returns the count of unique values from the series |
Arithmetic Operations
Arithmetic operations in Series is an element-wise operation, which means the operation is executed on the element level.
Operation | Methods |
---|---|
Addition | • series + X • series.add(X) |
Subtraction | • series – X • series.sub(X) • series.substract(X) |
Multiplication | • series * X • series.mul(X) • series.multiply(X) |
Division | • series / X • series.div(X) • series.divide(X) |
Floor Division | • series // X • sereis.floordiv(X) |
Modulo | • series % X • series.mod(X) |
- Here, X can be a single value or another list.
- When X is a single value, then the operation is executed on all the values inside the series
- When X is another sereis, then the operation is executed on the matching indexes.
- If one of the series has more values than another, then the operation returns NaN
s1 = pd.Series([1, 2, np.NAN, 4])
s2 = pd.Series([5, np.NAN, 6, 7, 8])
>>> s1 + s2
0 6.0
1 NaN
2 NaN
3 11.0
4 NaN
dtype: float64
>>> s1 - s2
0 -4.0
1 NaN
2 NaN
3 -3.0
4 NaN
dtype: float64
>>> s1 * s2
0 5.0
1 NaN
2 NaN
3 28.0
4 NaN
dtype: float64
>>> s1 / s2
0 0.200000
1 NaN
2 NaN
3 0.571429
4 NaN
dtype: float64
>>> s1 // s2
0 0.0
1 NaN
2 NaN
3 0.0
4 NaN
dtype: float64
>>> s1 % s2
0 1.0
1 NaN
2 NaN
3 4.0
4 NaN
dtype: float64
Sorting Series
We can use sort_value or sort_index methods for this sorting by values or index respectively. Let’s look at the parameters which can be passed to this method.
There are other parameters that can be used, you can find those in the documentation here.
Parameters | Description |
---|---|
ascending | bool or list of bools, default True If True, sort values in ascending order, otherwise descending. |
na_position | {‘first’ or ‘last’}, default ‘last’ Argument ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end. |
inplace | bool, default False If True, perform operation in-place. |
- We can overwrite the series by using the inplace parameter and setting it to True. But in the back end pandas creates a copy of the series and then reassigns the variable to a new object. Thus, contrary to popular belief, the inplace doesn’t offer any perfomance benefits.
Getting the smallest and largest values
We can simply sort our series in ascending or descending order and then limit the result to output the desired number of rows. But since this is a very common query, pandas have offered dedicated methods for this purpose: nlargest and nsmallest. By default, 5 rows are returned, but you can modify this behavior by passing an integer value as an argument. Here is a small example:
>>> diet.nlargest(2)
Chocolate Bar 406
Ice Cream Sundae 342
dtype: int64
>>> diet.nsmallest(1)
Cereal 125
dtype: int64
Couting values in a series
We can use the value_counts method, which counts the number of occurrences of each series value. The parameters for this method are:
Parameters | Description |
---|---|
normalize | bool, default False If True then the object returned will contain the relative frequencies of the unique values. |
sort | bool, default True Sort by frequencies. |
ascending | bool, default False Sort in ascending order. |
bins | int, optional Rather than count values, group them into half-open bins, a convenience for pd.cut, only works with numeric data. |
dropna | bool, default True Don’t include counts of NaN. |
Apply method
By using the apply method, we can invoke a function on every series value.
Let’s say in our diet series, we need to find if our food falls under a high calory diet or low. We can simply define a function that takes care of the logic and then apply that function to each value within the series.
def high_or_low_calories(calory):
return 'High' if calory > 150 else 'Low'
>>> diet
Cereal 125
Chocolate Bar 406
Ice Cream Sundae 342
dtype: int64
>>> diet.apply(high_or_low_calories)
Cereal Low
Chocolate Bar High
Ice Cream Sundae High
dtype: object
This concludes the basic foundation of Series within Pandas. I will add additional posts for any advanced use cases as I come across them, so keep an eye out for that.