Importing Variable Names with Occurrence Quantities in R using dplyr and tidyr

Data Import and Cells as Variables with Quantities

=====================================================

In this article, we will explore how to import a text file containing variable names with occurrence quantities or without any variables. We will use the dplyr and tidyr packages in R to achieve this.

Background


The text file contains rows where each column is separated by a space. The first two columns contain variable values, while the third column may contain variable names with occurrence quantities. For example:

V1V2Variables:Qty
1234000138571:20
1235000783744:1 74572:12 29401:12

Our goal is to import this data into R and transform it into a more usable format.

Sample Data


Here is some sample data that we will use for demonstration purposes:

V1   V2       V3       V4       V5
1: 1234    1 38571:20                  
2: 1235    7  83744:1 74572:12 29401:12
3: 8485 1284  74572:1 38571:16         
4: 9412 8310                           

Using dplyr and tidyr


We will use the dplyr and tidyr packages to achieve our goal.

Step 1: Load Required Libraries

First, we need to load the required libraries:

library(tidyr)
library(dplyr)

Step 2: Import Data

Next, we import the data into R using read.table() or read.csv():

df1 <- read.table(text="V1   V2   Variables:Qty
                        1234 0001 '38571:20'
                        1235 0007 '83744:1 74572:12 29401:12'
                        8485 1284 '74572:1 38571:16'
                        9412 8310 ''",
                  header=TRUE, colClasses='character')

rawdata <- read.table(text="1234 0001 38571:20
                            1235 0007 83744:1 74572:12 29401:12
                            8485 1284 74572:1 38571:16
                            9412 8310",
                      header=FALSE, colClasses='character', fill=TRUE, na.strings='')

Step 3: Transform Data

We can now transform the data using dplyr and tidyr.

Step 3.1: Split Variables into Names and Quantities

First, we split the variable values into names and quantities:

rawdata %>% 
  gather(key, val, 3:5) %>% 
  separate(val, c('Variables', 'Qty'), sep = ':')

This step splits the val column into two new columns: Variables and Qty.

Step 3.2: Convert Quantity to Numeric

Next, we convert the quantity to a numeric value:

mutate(Qty = as.numeric(Qty))

This step converts the Qty column from character to numeric.

Step 3.3: Spread Variables

Now, we spread the variables back into separate columns:

spread(Variables, Qty)

This step spreads the Variables column back into two new columns: key_1 and key_2.

Step 3.4: Calculate Cumulative Key

Next, we calculate a cumulative key based on the original key value:

mutate(key = parse_number(key), 
       key = cumsum(key < lag(key, default = 0)))

This step calculates a new column called key.

Step 3.5: Select Non-Cumulative Key

Finally, we select only the non-cumulative keys:

select(-8)

This step removes the cumulative key from the data.

Step 4: Group and Summarize Data

We can now group and summarize the data using dplyr:

group_by(V1, V2, key) %>% 
summarise_all(sum)

This step groups the data by V1, V2, and key, and then sums all numeric columns.

Result


The final result is a dataset that has been transformed from variable names with occurrence quantities to a more usable format:

Source: local data frame [4 x 6]
Groups: V1, V2 [4]

     V1    V2 `29401` `38571` `74572` `83744`
   &lt;chr&gt; &lt;chr&gt;   &lt;dbl&gt;   &lt;dbl&gt;   &lt;dbl&gt;   &lt;dbl&gt;
1  1234  0001       0      20       0       0
2  1235  0007      12       0      12       1
3  8485  1284       0      16       1       0
4  9412  8310       0       0       0       0

This dataset can now be used for further analysis or visualization.


Last modified on 2024-09-30