Data Import and Cells as Variables with Quantities
=====================================================
In this article, we will explore how to import a text file containing variable names with occurrence quantities or without any variables. We will use the dplyr and tidyr packages in R to achieve this.
Background
The text file contains rows where each column is separated by a space. The first two columns contain variable values, while the third column may contain variable names with occurrence quantities. For example:
| V1 | V2 | Variables:Qty |
|---|---|---|
| 1234 | 0001 | 38571:20 |
| 1235 | 0007 | 83744:1 74572:12 29401:12 |
Our goal is to import this data into R and transform it into a more usable format.
Sample Data
Here is some sample data that we will use for demonstration purposes:
V1 V2 V3 V4 V5
1: 1234 1 38571:20
2: 1235 7 83744:1 74572:12 29401:12
3: 8485 1284 74572:1 38571:16
4: 9412 8310
Using dplyr and tidyr
We will use the dplyr and tidyr packages to achieve our goal.
Step 1: Load Required Libraries
First, we need to load the required libraries:
library(tidyr)
library(dplyr)
Step 2: Import Data
Next, we import the data into R using read.table() or read.csv():
df1 <- read.table(text="V1 V2 Variables:Qty
1234 0001 '38571:20'
1235 0007 '83744:1 74572:12 29401:12'
8485 1284 '74572:1 38571:16'
9412 8310 ''",
header=TRUE, colClasses='character')
rawdata <- read.table(text="1234 0001 38571:20
1235 0007 83744:1 74572:12 29401:12
8485 1284 74572:1 38571:16
9412 8310",
header=FALSE, colClasses='character', fill=TRUE, na.strings='')
Step 3: Transform Data
We can now transform the data using dplyr and tidyr.
Step 3.1: Split Variables into Names and Quantities
First, we split the variable values into names and quantities:
rawdata %>%
gather(key, val, 3:5) %>%
separate(val, c('Variables', 'Qty'), sep = ':')
This step splits the val column into two new columns: Variables and Qty.
Step 3.2: Convert Quantity to Numeric
Next, we convert the quantity to a numeric value:
mutate(Qty = as.numeric(Qty))
This step converts the Qty column from character to numeric.
Step 3.3: Spread Variables
Now, we spread the variables back into separate columns:
spread(Variables, Qty)
This step spreads the Variables column back into two new columns: key_1 and key_2.
Step 3.4: Calculate Cumulative Key
Next, we calculate a cumulative key based on the original key value:
mutate(key = parse_number(key),
key = cumsum(key < lag(key, default = 0)))
This step calculates a new column called key.
Step 3.5: Select Non-Cumulative Key
Finally, we select only the non-cumulative keys:
select(-8)
This step removes the cumulative key from the data.
Step 4: Group and Summarize Data
We can now group and summarize the data using dplyr:
group_by(V1, V2, key) %>%
summarise_all(sum)
This step groups the data by V1, V2, and key, and then sums all numeric columns.
Result
The final result is a dataset that has been transformed from variable names with occurrence quantities to a more usable format:
Source: local data frame [4 x 6]
Groups: V1, V2 [4]
V1 V2 `29401` `38571` `74572` `83744`
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1234 0001 0 20 0 0
2 1235 0007 12 0 12 1
3 8485 1284 0 16 1 0
4 9412 8310 0 0 0 0
This dataset can now be used for further analysis or visualization.
Last modified on 2024-09-30