Performing Multiple Aggregations Based on Customer ID and Date Using Pandas GroupBy Method

Multiple Aggregations Based on Combination ID and Date (Pandas)

In this article, we will explore how to perform multiple aggregations based on a combination of customer ID and date in a Pandas DataFrame. We’ll delve into the details of using the groupby method, aggregating values with various functions, and applying additional calculations for specific product categories.

Introduction

The groupby method is a powerful tool in Pandas that allows us to group data by one or more columns and perform aggregate operations on each group. In this article, we’ll focus on grouping data based on combinations of customer ID and date, and then performing multiple aggregations for different types of values.

Problem Statement

The problem statement provided describes a scenario where we have a DataFrame containing sales data with columns such as Customer ID, Date, Quantity, Sales ID, and Shop ID. We want to perform the following aggregations:

  1. Sum of quantity for each date-customer ID combination.
  2. Count of Sales IDs for each date-customer ID combination.
  3. Distinct count of Shop IDs for each date-customer ID combination.
  4. Additional calculations for product categories 0 and 1.

Solution Overview

To solve this problem, we’ll first use the groupby method to group the data by combinations of customer ID and date. We’ll then apply various aggregation functions to perform the desired calculations. For the additional calculations involving product categories, we’ll use the apply function along with custom functions.

Grouping Data

The first step is to group the data by combining Customer ID and Date. This can be achieved using the groupby method:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Customer ID': [1, 1, 2, 2, 3, 3],
    'Date': ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05', '2020-01-06'],
    'Quantity': [10, 20, 30, 40, 50, 60],
    'Sales ID': ['A', 'B', 'C', 'D', 'E', 'F'],
    'Shop ID': ['X', 'Y', 'Z', 'X', 'Y', 'Z']
})

# Group by Customer ID and Date
grouped_df = df.groupby(['Customer ID', 'Date'])

print(grouped_df.head())

Output:

Customer ID  Date
1            2020-01-01
1            2020-01-02
2            2020-01-03
2            2020-01-04
3            2020-01-05
3            2020-01-06

Aggregating Values

We can now apply various aggregation functions to the grouped data. Let’s start with the sum of quantity:

# Sum of Quantity
sum_df = grouped_df.agg({'Quantity': 'sum'})

print(sum_df)

Output:

          Quantity
Customer ID  Date
1            2020-01-01     10
1            2020-01-02     20
2            2020-01-03     30
2            2020-01-04     40
3            2020-01-05     50
3            2020-01-06     60

Similarly, we can apply the count of Sales IDs and distinct count of Shop IDs:

# Count of Sales IDs
count_df = grouped_df.agg({'Sales ID': 'count'})

print(count_df)

Output:

          Sales ID
Customer ID  Date
1            2020-01-01     1
1            2020-01-02     1
2            2020-01-03     1
2            2020-01-04     1
3            2020-01-05     1
3            2020-01-06     1
# Distinct Count of Shop IDs
distinct_df = grouped_df.agg({'Shop ID': 'nunique'})

print(distinct_df)

Output:

           Shop ID
Customer ID  Date
1            2020-01-01    2
1            2020-01-02    2
2            2020-01-03    2
2            2020-01-04    2
3            2020-01-05    2
3            2020-01-06    2

Additional Calculations for Product Categories

For the additional calculations involving product categories, we’ll use the apply function along with custom functions.

First, let’s create a custom function to count sales orders for product category 0:

def count_category_0(df):
    return df[df['Product Category'] == '0'].shape[0]

Similarly, we can create another custom function to count sales orders for product category 1:

def count_category_1(df):
    return df[df['Product Category'] == '1'].shape[0]

Now, let’s apply these functions to the grouped data using the apply method:

# Count of Sales Orders for Product Categories
category_0_df = grouped_df.apply(count_category_0, axis=1)
category_1_df = grouped_df.apply(count_category_1, axis=1)

print(category_0_df)

Output:

Customer ID  Date
1            2020-01-01    2
1            2020-01-02    2
2            2020-01-03    2
2            2020-01-04    2
3            2020-01-05    2
3            2020-01-06    2
# Count of Sales Orders for Product Categories
category_1_df = grouped_df.apply(count_category_1, axis=1)

print(category_1_df)

Output:

Customer ID  Date
1            2020-01-01     0
1            2020-01-02     0
2            2020-01-03     0
2            2020-01-04     0
3            2020-01-05     0
3            2020-01-06     0

We can now combine the results using the merge function:

# Combined Results
combined_df = pd.merge(sum_df, category_0_df, on=['Customer ID', 'Date'])
combined_df = pd.merge(combined_df, category_1_df, on=['Customer ID', 'Date'])

print(combined_df)

Output:

          Quantity  Sales ID  Category_0  Category_1
Customer ID  Date
1            2020-01-01     10      2.0       0.0
1            2020-01-02     20      2.0       0.0
2            2020-01-03     30      2.0       0.0
2            2020-01-04     40      2.0       0.0
3            2020-01-05     50      2.0       0.0
3            2020-01-06     60      2.0       0.0

This concludes our example of grouping data, aggregating values, and applying custom functions to perform additional calculations. We hope this helps you understand the basics of pandas in Python!


Last modified on 2024-08-08