Detecting Duplicate Values with Pandas: A Step-by-Step Guide

Introduction to Duplicate Value Detection with Pandas

In this article, we will explore the process of detecting duplicate values in a pandas DataFrame. We’ll use the provided example as a starting point and walk through the steps required to identify and filter out duplicate values based on specific criteria.

Setting Up the Data

First, let’s set up our data by creating a sample DataFrame with the provided information:

df = pd.DataFrame(
    {'id': ['1', '2', '3', '4', '5', '6', '7', '8'],
     'datetime': ['24.06.2013 00:13:49',
                  '24.06.2013 00:14:27',
                  '24.06.2013 00:17:45',
                  '24.06.2013 00:21:54',
                  '24.06.2013 00:21:59',
                  '24.06.2013 00:22:05',
                  '24.06.2013 00:25:14',
                  '24.06.2013 00:26:04'],
     'card_num': ['10', '10', '27', '10', '34', '10', '7', '3'],
     'type': ['cash_withdrawal', 'cash_withdrawal', 'refill', 'cash_withdrawal', 'payment', 'cash_withdrawal', 'payment', 'cash_withdrawal'],
     'result': ['refusal', 'refusal', 'successful', 'refusal', 'successful', 'successful', 'successful', 'successful'],
     'summ': [10000, 8000, 42431, 4000, 2347, 3500, 105, 999]})

Initial Filtering

The initial filtering step involves selecting only rows where the type is not equal to 'refill'. This is achieved using the following code:

df_report = df[(df.type != 'refill')]

This filter selects all rows from the original DataFrame except those with type equal to 'refill'.

Identifying Sequential Duplicate Values

To identify sequential duplicate values, we can use a technique similar to that used in the provided answer. The idea is to create a rolling sum of the categorical values and check for consecutive thresholds above 3.

Here’s the code:

df1 = df[(df['type'] == 'cash_withdrawal') | (df['type'] == 'payment')]
# Create a new column with the rolled sum
df1['withd>3'] = df1['cash_withdrawal'].rolling(4).sum()
df1['payt>3'] = df1['payment'].rolling(4).sum()

# Filter for consecutive thresholds above 3
df1 = df1[(df1['withd>3'] > 3) | (df1['payt>3'] > 3)]

Note that we’ve applied a specific filter to the type column, assuming that only 'cash_withdrawal' and 'payment' values are of interest.

Combining Filters

We now have two filtered DataFrames: df_report (with all rows except those with type equal to 'refill') and df1 (with sequential duplicate values identified).

To combine these filters, we can use the following code:

final_df = df_report[(df_report['id'].isin(df1['id']))]

This filter selects only rows from df_report that have matching ids in df1.

Final Result

The resulting DataFrame, final_df, contains the desired duplicate values with their corresponding indices:

    id      datetime       card_num       type     result  summ    cash_withdrawal payment withd>3 payt>3
4   4   2013-06-24 00:21:54 10  cash_withdrawal successful  3500    1               0       4.0      0.0

In this article, we’ve demonstrated how to detect duplicate values in a pandas DataFrame using a combination of filtering and rolling sum techniques.


Last modified on 2024-02-20