Data Filtering and Aggregation in R: A Step-by-Step Guide
Introduction
Data analysis is a crucial step in understanding complex data sets. One of the fundamental tasks in data analysis is filtering and aggregating data based on specific criteria. In this article, we will explore how to select rows based on customer IDs in R programming language. We will also discuss how to find the last 3 actions performed by each customer ID.
Prerequisites
Before diving into the code, it’s essential to have a basic understanding of R programming language and its data structures. Specifically, you should be familiar with:
- Data frames (df)
- Filtering data
- List manipulation in R
- Aggregation functions (e.g.,
sum,mean,max)
Installing Required Packages
To perform the filtering and aggregation operations described in this article, we will need to install some additional packages. The most commonly used packages for data analysis in R are:
dplyr: A package for functional programming that provides a grammar of data manipulation.readr: A package for reading and writing files in various formats.
You can install these packages using the following command:
install.packages(c("dplyr", "readr"))
Data Importing
To perform the filtering and aggregation operations, we will need a sample dataset. Let’s create a simple dataset with customer IDs, dates, actions, and other columns.
# Load required libraries
library(dplyr)
library(readr)
# Create a sample dataset
set.seed(123) # For reproducibility
df <- data.frame(
id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
date = as.Date(c("2020-01-01", "2020-01-02", "2020-01-03", "2020-01-04",
"2020-01-05", "2020-01-06", "2020-01-07", "2020-01-08",
"2020-01-09", "2020-01-10")),
action = c("Buy", "Sell", "Buy", "Sell", "Buy", "Sell", "Buy", "Sell", "Buy", "Sell"),
product = sample(c("A", "B", "C"), size = 10, replace = TRUE),
quantity = runif(10, 0, 100)
)
# Print the first few rows of the dataset
head(df)
This will create a simple dataset with customer IDs, dates, actions, products, and quantities.
Filtering Data
To filter the data based on specific criteria, we can use the filter() function from the dplyr package.
# Filter the data to only include rows where id is in the list of desired IDs
desired_ids <- c(1, 2, 5, 6, 10) # Replace with your own list of desired IDs
df_filtered <- df %>%
filter(id %in% desired_ids)
# Print the first few rows of the filtered dataset
head(df_filtered)
In this example, we create a vector desired_ids containing the customer IDs that we want to include in our filtered dataset. We then use the filter() function to select only the rows where the id column matches one of these desired IDs.
Finding Last 3 Actions for Each Customer ID
To find the last 3 actions performed by each customer ID, we can use the slice() and arrange() functions from the dplyr package.
# Sort the filtered dataset by date in descending order (newest first)
df_sorted <- df_filtered %>%
arrange(desc(date))
# Select only the last 3 rows for each customer ID (i.e., the most recent actions)
df_last_actions <- df_sorted %>%
slice(n = 3) %>%
group_by(id) %>%
top_n(3, .id)
# Print the first few rows of the dataset with last 3 actions
head(df_last_actions)
In this example, we sort the filtered dataset by date in descending order (newest first) using arrange(). We then use the slice() function to select only the last 3 rows for each customer ID. Finally, we group the data by id and use the top_n() function to select only the top 3 actions for each customer ID.
Conclusion
In this article, we discussed how to select rows based on customer IDs in R programming language using the dplyr package. We also demonstrated how to find the last 3 actions performed by each customer ID. By following these steps and practicing with your own data, you should be able to effectively filter and aggregate your dataset.
Further Reading
Note: The code blocks in this article are formatted using Hugo’s highlight shortcode. You can copy and paste the code into your own R environment to experiment with it.
Last modified on 2023-09-29