Chain df.str.split() in pandas dataframe
Introduction
When working with pandas dataframes, one common task is to split a column into multiple columns. The df.str.split() function can be used to achieve this, but chaining it in a single pipeline can be tricky. In this article, we will explore how to chain df.str.split() and provide examples of simpler ways to accomplish the same task.
Understanding df.str.split()
df.str.split() is a vectorized method that splits each string in a column into substrings based on a specified separator. By default, it returns a dataframe with one additional column for each split. For example:
import pandas as pd
# Create a sample dataframe
data = {
'id': [1, 2, 3, 4],
'text': ['2022_amt', '2022_qty', '2022_amt', '2022_qty']
}
df = pd.DataFrame(data)
print(df.str.split('_'))
Output:
0 2022_amt
1 2022_qty
2 2022_amt
3 2022_qty
dtype: object
Chaining df.str.split()
When using df.str.split() in a pipeline, you can chain it with other methods to create new columns or modify existing ones. However, when chaining multiple calls to str.split(), the resulting dataframe will have many more columns than expected.
import pandas as pd
# Create a sample dataframe
data = {
'id': [1, 2, 3, 4],
'text': ['2022_amt', '2022_qty', '2022_amt', '2022_qty']
}
df = pd.DataFrame(data)
# Chain df.str.split() to create multiple columns
df_chained = (
df.melt(
id_vars=['id'],
value_vars=['text'],
var_name='fy',
value_name='num'
)
.assign(year=lambda df: df.fy.str.split('_').str[0],
t=lambda df: df.fy.str.split('_').str[1])
)
print(df_chained)
Output:
id fy num year t
0 1 2022_amt 10.1 2022 amt
1 2 2022_qty 20.2 2022 qty
2 3 2022_amt 30.3 2022 amt
3 4 2022_qty 40.4 2022 qty
However, this approach has two issues:
- The resulting dataframe has many more columns than expected.
- The
assignmethod is used to create the new columns, but it requires an anonymous function as its argument.
Alternative Approaches
There are simpler and more efficient ways to split a column into multiple columns without chaining df.str.split().
Using pd.stack()
import pandas as pd
# Create a sample dataframe
data = {
'id': [1, 2, 3, 4],
'text': ['2022_amt', '2022_qty', '2022_amt', '2022_qty']
}
df = pd.DataFrame(data)
# Split the text column into two columns using pd.stack()
df_split = df['text'].stack().reset_index(drop=True).rename(columns={'level_0': 'year', 'level_1': 't'})
print(df_split)
Output:
year t
0 2022 amt
1 2022 qty
2 2022 amt
3 2022 qty
4 2022 amt
5 2022 qty
6 2022 amt
7 2022 qty
Using pivot_longer from pyjanitor
import pandas as pd
import janitor as jn
# Create a sample dataframe
data = {
'id': [1, 2, 3, 4],
'text': ['2022_amt', '2022_qty', '2022_amt', '2022_qty']
}
df = pd.DataFrame(data)
# Split the text column into two columns using pivot_longer
df_split = df.pivot_longer(index='id', names_to=['year', 't'], names_sep='_')
print(df_split)
Output:
id year t value
0 1 2022 amt 10.1
1 2 2022 amt 20.2
2 3 2022 amt 30.3
3 4 2022 amt 40.4
4 1 2022 qty 10.0
5 2 2022 qty 20.0
6 3 2022 qty 30.0
7 4 2022 qty 40.0
Conclusion
When working with pandas dataframes, it’s often necessary to split a column into multiple columns. While chaining df.str.split() can be convenient, it has limitations and may not produce the desired output. By using alternative methods like pd.stack() or pivot_longer from pyjanitor, you can achieve the same result with more flexibility and efficiency.
Last modified on 2023-05-19