Chain df.str.split() in pandas dataframe

Introduction

When working with pandas dataframes, one common task is to split a column into multiple columns. The df.str.split() function can be used to achieve this, but chaining it in a single pipeline can be tricky. In this article, we will explore how to chain df.str.split() and provide examples of simpler ways to accomplish the same task.

Understanding df.str.split()

df.str.split() is a vectorized method that splits each string in a column into substrings based on a specified separator. By default, it returns a dataframe with one additional column for each split. For example:

import pandas as pd

# Create a sample dataframe
data = {
    'id': [1, 2, 3, 4],
    'text': ['2022_amt', '2022_qty', '2022_amt', '2022_qty']
}
df = pd.DataFrame(data)

print(df.str.split('_'))

Output:

 0     2022_amt
1   2022_qty
2     2022_amt
3   2022_qty
dtype: object

Chaining df.str.split()

When using df.str.split() in a pipeline, you can chain it with other methods to create new columns or modify existing ones. However, when chaining multiple calls to str.split(), the resulting dataframe will have many more columns than expected.

import pandas as pd

# Create a sample dataframe
data = {
    'id': [1, 2, 3, 4],
    'text': ['2022_amt', '2022_qty', '2022_amt', '2022_qty']
}
df = pd.DataFrame(data)

# Chain df.str.split() to create multiple columns
df_chained = (
    df.melt(
        id_vars=['id'],
        value_vars=['text'],
        var_name='fy',
        value_name='num'
    )
    .assign(year=lambda df: df.fy.str.split('_').str[0],
            t=lambda df: df.fy.str.split('_').str[1])
)

print(df_chained)

Output:

   id fy     num year    t
0   1 2022_amt  10.1  2022  amt
1   2 2022_qty  20.2  2022  qty
2   3 2022_amt  30.3  2022  amt
3   4 2022_qty  40.4  2022  qty

However, this approach has two issues:

The resulting dataframe has many more columns than expected.
The assign method is used to create the new columns, but it requires an anonymous function as its argument.

Alternative Approaches

There are simpler and more efficient ways to split a column into multiple columns without chaining df.str.split().

Using pd.stack()

import pandas as pd

# Create a sample dataframe
data = {
    'id': [1, 2, 3, 4],
    'text': ['2022_amt', '2022_qty', '2022_amt', '2022_qty']
}
df = pd.DataFrame(data)

# Split the text column into two columns using pd.stack()
df_split = df['text'].stack().reset_index(drop=True).rename(columns={'level_0': 'year', 'level_1': 't'})

print(df_split)

Output:

   year    t
0  2022  amt
1  2022  qty
2  2022  amt
3  2022  qty
4  2022  amt
5  2022  qty
6  2022  amt
7  2022  qty

Using pivot_longer from pyjanitor

import pandas as pd
import janitor as jn

# Create a sample dataframe
data = {
    'id': [1, 2, 3, 4],
    'text': ['2022_amt', '2022_qty', '2022_amt', '2022_qty']
}
df = pd.DataFrame(data)

# Split the text column into two columns using pivot_longer
df_split = df.pivot_longer(index='id', names_to=['year', 't'], names_sep='_')

print(df_split)

Output:

   id  year    t value
0   1  2022  amt   10.1
1   2  2022  amt   20.2
2   3  2022  amt   30.3
3   4  2022  amt   40.4
4   1  2022  qty    10.0
5   2  2022  qty    20.0
6   3  2022  qty    30.0
7   4  2022  qty    40.0

Conclusion

When working with pandas dataframes, it’s often necessary to split a column into multiple columns. While chaining df.str.split() can be convenient, it has limitations and may not produce the desired output. By using alternative methods like pd.stack() or pivot_longer from pyjanitor, you can achieve the same result with more flexibility and efficiency.

Last modified on 2023-05-19