Applying Custom-Made Function to Column Pairs and Creating Summary Table
In this article, we will explore how to apply a custom-made function to column pairs in a dataset and create a summary table. This is achieved by pivoting the data multiple times, applying the function across all the data, grouping by the variable of interest, and summarizing the results.
Introduction
When working with datasets that contain ratings or scores from multiple sources, it’s often necessary to compare and analyze these ratings to identify patterns, trends, or areas for improvement. One common approach is to use a custom-made function to calculate differences between column pairs and compute an agreement percentage.
In this article, we will walk through the steps required to achieve this goal using the tidyverse package in R.
Setting Up the Data
To begin with, let’s create a sample dataset that contains ratings on multiple parameters by two different raters. The data is represented as follows:
df <- structure(list(
DH = c(0, 1, NA, NA, 1, 1, 1, 1, 1, 1),
DH_ptak = c(0, 1, 1, 1, 1, 1, 1, 1, 1, 1),
SZ = c(1, 1, NA, NA, NA, 0, 1, 0, 1, 1),
SZ_ptak = c(1, 1, NA, NA, NA, 1, 0, NA, 1, 1),
RM = c(0, 1, 1, NA, NA, NA, 0, NA, 1, NA),
RM_ptak = c(0, 1, 1, 1, 1, NA, 0, 1, NA, 1)
), row.names = c(NA, 10L), class = "data.frame")
Defining the Custom-Made Function
The custom-made function compare_fun is used to find differences between ratings. It takes two input columns as arguments and returns an integer value based on the comparison:
library(dplyr)
compare_fun <- function(c1, c2) {
case_when(
is.na(c1) & is.na(c2) ~ 0,
is.na(c1) | is.na(c2) ~ 1,
c1 != c2 ~ 1,
TRUE ~ 0
)
}
Pivoting the Data
To apply this function across all column pairs, we need to pivot the data multiple times. First, we convert the row names into a column:
df <- df %>%
rownames_to_column("col")
Next, we pivot the data from wide to long format using pivot_longer:
df <- df %>%
pivot_longer(-col) %>%
separate(name, into = c("var", "tmp"), sep = "_") %>%
mutate(grp = ifelse(is.na(tmp), "grp1", "grp2"))
In this step, we create a new column var that represents the variable of interest and another column tmp that contains the temporary values. We also use the separate function to rename the original columns.
Applying the Custom-Made Function
Now that we have pivoted the data, we can apply the custom-made function across all column pairs using the mutate function:
df <- df %>%
select(col, var, value, grp) %>%
pivot_wider(names_from = grp, values_from = value) %>%
mutate(diff = compare_fun(grp1, grp2))
Here, we use the pivot_wider function to reshape the data back into wide format and apply the custom-made function to each pair of columns.
Grouping and Summarizing
Finally, we group the data by the variable of interest (var) and summarize the results using the summarise function:
df <- df %>%
group_by(var) %>%
summarise(sum = sum(diff),
agree_pct = (nrow(df)-sum(diff))/nrow(df)*100)
In this step, we calculate the total difference (sum) and agreement percentage (agree_pct) for each variable.
Conclusion
By following these steps, you can apply a custom-made function to column pairs in your dataset and create a summary table. This approach is particularly useful when working with datasets that contain ratings or scores from multiple sources.
Here’s the complete code:
library(tidyverse)
# Set up the data
df <- structure(
list(
DH = c(0, 1, NA, NA, 1, 1, 1, 1, 1, 1),
DH_ptak = c(0, 1, 1, 1, 1, 1, 1, 1, 1, 1),
SZ = c(1, 1, NA, NA, NA, 0, 1, 0, 1, 1),
SZ_ptak = c(1, 1, NA, NA, NA, 1, 0, NA, 1, 1),
RM = c(0, 1, 1, NA, NA, NA, 0, NA, 1, NA),
RM_ptak = c(0, 1, 1, 1, 1, NA, 0, 1, NA, 1)
),
row.names = c(NA, 10L), class = "data.frame"
)
# Define the custom-made function
compare_fun <- function(c1, c2) {
case_when(
is.na(c1) & is.na(c2) ~ 0,
is.na(c1) | is.na(c2) ~ 1,
c1 != c2 ~ 1,
TRUE ~ 0
)
}
# Pivot the data
df <- df %>%
rownames_to_column("col") %>%
pivot_longer(-col) %>%
separate(name, into = c("var", "tmp"), sep = "_") %>%
mutate(grp = ifelse(is.na(tmp), "grp1", "grp2"))
# Apply the custom-made function
df <- df %>%
select(col, var, value, grp) %>%
pivot_wider(names_from = grp, values_from = value) %>%
mutate(diff = compare_fun(grp1, grp2))
# Group and summarize
df <- df %>%
group_by(var) %>%
summarise(sum = sum(diff),
agree_pct = (nrow(df)-sum(diff))/nrow(df)*100)
Output:
# A tibble: 3 x 3
var sum agree_pct
<chr> <dbl> <dbl>
1 DH 2 80
2 RM 5 50
3 SZ 3 70
Last modified on 2024-10-16