Appendix A — Wrangling data in R

A.1 Logical indexing

Logical indexing is one of the most powerful data wrangling techniques. It uses logical operators like == (equal to) and != (not equal to) to subset data based on multiply criteria. Below are some examples to explain indexing, logical operators, and finally subsetting data with logical indexing.

A.1.1 Indexing data

If x is a vector of the letters A through F in alphabetical order, then we can call the fourth letter (D) by the syntax x[4]:

x <- c("A", "B", "C", "D", "E", "F")

x[4]

[1] "D"

If x is a matrix (or a dataframe) then we can index the data by the syntax x[row, column]. For example

x[2, 3] - Extract the data in row 2, column 3.
x[ , 3] - Extract all rows in column 3.
x[2, ] - Extract all columns in row 2.

# Create a matrix with 3 columns
x <- matrix(c("A", "B", "C", "D", "E", "F"), ncol = 3)

# Print full matrix
x

     [,1] [,2] [,3]
[1,] "A"  "C"  "E" 
[2,] "B"  "D"  "F"

# Row 2, column 3
x[2, 3]

[1] "F"

# All rows, column 3
x[ , 3]

[1] "E" "F"

# Row 2, all columns
x[2, ]

[1] "B" "D" "F"

A.1.2 Logical operators in R

Logical operators are basic statements that return TRUE or FALSE (table A.1).

Table A.1: Some of the logical operators available in R

Name	Operator	Returns `TRUE`	Returns `FALSE`
Equal to	`==`	`"A" == "A"`	`"A" == "B"`
Not equal to	`!=`	`"A" != "B"`	`"A" != "A"`
Less than	`<`	`2 < 3"`	`3 < 2`
Greater than	`>`	`15 > 10`	`10 > 15`
Less than or equal to	`<=`	`10 <= 10` or `9 <= 10`	`10 <= 9`
Greater than or equal to	`>=`	`22 >= 22` or `25 >= 22`	`20 >= 15`
In	`%in%`	`"A" %in% c("A", "B" "C")`	`"F" %in% c("A", "B" "C")`

If we have a vector x with multiple instances of A, B and C we can index it as:

# Create the vector
x <- rep(c("A", "B", "C"), each = 3)

# Print the vector
x

[1] "A" "A" "A" "B" "B" "B" "C" "C" "C"

# Check where the elements of x equals "B"
x == "B"

[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE

If we want to check multiple statements at once we can use the AND & and the OR | operator.

# Check where the elements of x equals "B" OR "C"
x == "A" | x == "C"

[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE

# Check where the elements of x equals "B" AND "C"
x == "A" & x == "C"

[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

A.1.3 Logical indexing of data

It is possible to use a logical vector (i.e. a vector that only contains TRUE and FALSE) to index data. When passing a logical vector to a data object it returns the data in all the places where the vector equals TRUE.

As an example, we load the palmerpenguins dataset, inspect it and do some simple logical indexing.

# Inspect the data
head(penguins)

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

# Extract all rows where species equals "Adelie"
x <- penguins[penguins$species == "Adelie", ]

head(x)

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

# Extract all rows where species equals "Adelie"
# AND body mass in greater than 4000
x <- penguins[penguins$species == "Adelie" & penguins$body_mass_g > 4000, ]

head(x)

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 <NA>    <NA>                NA            NA                  NA          NA
2 Adelie  Torgersen           39.2          19.6               195        4675
3 Adelie  Torgersen           42            20.2               190        4250
4 Adelie  Torgersen           34.6          21.1               198        4400
5 Adelie  Torgersen           42.5          20.7               197        4500
6 Adelie  Torgersen           46            21.5               194        4200
# ℹ 2 more variables: sex <fct>, year <int>

A.2 The pipe operator

In R you have the opportunity to use a special syntax called the pipe operator |>. It can be thought of as a pipe that “sends data downstream” to the next call in your script.

If we for example want to take the square root of some data df and afterwards show the first 6 values of that data with head() we can write

df |>
  sqrt() |>
  head()

which is equivalent to

head(sqrt(df))

This might not make much sense for this example, but when things get a bit more complex the pipe operator can really help by making code easier to read.

If we for example want to format some data df and then plot it in ggplot2 we can write the following script

df |>
  pivot_longer(
    cols = everything(),
    names_to = "group",
    names_prefix = "X",
    values_to = "response"
    ) |>
  mutate(group = str_c("Group ", group)) |>
  ggplot(aes(group, response)) +
  geom_boxplot()

Without the pipe we would need to create several intermediate variables cluttering up our script and environment in the process.

A.3 Tidy data

“Tidy data” is a way of storing data in tables that is commonly used in R. A dataset is considered tidy when it adheres to these two principles:

Each column is variable.
Each row is an observation.

It is easier to understand by looking at an example:

Let us imagine that we have conducted a weight loss study by weighing 10 persons each week for 4 weeks total. That is

\[ n = 10 \times 4 = 40 \text{ observations.} \]

The data from the study is loaded into R and displayed.

head(weight_data) # Show the first 6 rows of the data

  person    week1    week2     week3    week4
1      1 63.65420 51.82889  70.43640 69.64181
2      2 82.52186 57.75702 107.99418 71.58120
3      3 46.23159 56.47377  81.30373 60.50684
4      4 87.45990 94.92606  52.25914 75.58260
5      5 43.33043 73.33685  58.24029 85.13847
6      6 95.67937 54.84374  85.67053 92.48265

It is clear that this data does not currently conform to the principles of tidy data. The variable “Which week is this measurement from?” is spread out across the column names of the last 4 columns. That is not practical for working with it in R, and something as simple as plotting the data using ggplot2 would be almost impossible.

In the next section, we use a bit of code to convert this into a tidy data format.

A.3.1 How to convert to tidy data

Using Tidyverse (remember to load it with library(tidyverse)), it is possible to pivot our current data frame into the longer tidy format.

weight_data_tidy <- weight_data |> 
  pivot_longer(
    cols = -person, # Which columns to pivot (everything but "person")
    names_to = "week", # Name of the column to store previous column names
    values_to = "weight" # Name of the column to store data in
  )

head(weight_data_tidy)

# A tibble: 6 × 3
  person week  weight
   <int> <chr>  <dbl>
1      1 week1   63.7
2      1 week2   51.8
3      1 week3   70.4
4      1 week4   69.6
5      2 week1   82.5
6      2 week2   57.8

Now the data has been converted into a tidy format. Each row is an observation, and each column is a variable describing a parameter of the observation:

person is the person number.
week is the week of the weight measurement.
weight is the weight recorded in that observation.

A.4 Summarise by group

Sometimes we want to calculate a statistic per group. There are many different ways of doing this, and in these examples we are going to present a way possible ways.

For these examples we use the Palmer Penguins dataset. It consists of some data regarding the bill size, flipper length and body mass of three penguin species across three islands.

Let us start by taking a look at our data. This can be done via the head() function or the glimpse() function from the Tidyverse.

head(penguins) # Show the first 6 rows of dataframe

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

glimpse(penguins)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

A.4.1 One statistic per group

If we want to summarise one statistic, that could be the mean or the standard deviation, grouped by species or island (or both), we can do it in the following ways.

Using Tidyverse
Using base R

Important

If the pipe operator |> is unkown to you see Section A.2.

To compute the mean of all numeric variables we can use the summarise() function from the dplyr package in Tidyverse. Remember to load the tidyverse package (library(tidyverse)) - this package includes dplyr as well as a whole lot of other nice packages.

For this example we want the mean for all numeric variables grouped by species.

penguins |>
  drop_na() |> # Remove rows with missing values
  summarise(
    across(
      .cols = where(is.numeric), # Chose columns that are numeric 
      .fns = mean # Set the function we want to use
           ),
    .by = species # Group by species
    )

# A tibble: 3 × 6
  species   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
  <fct>              <dbl>         <dbl>             <dbl>       <dbl> <dbl>
1 Adelie              38.8          18.3              190.       3706. 2008.
2 Gentoo              47.6          15.0              217.       5092. 2008.
3 Chinstrap           48.8          18.4              196.       3733. 2008.

Be aware that the function drops all non-numeric variables that are not part of the grouping. So the output of the above code is “missing” the island and sex variables. Also, the drop_na() function removes all rows with missing values - if this is not done all columns with missing values will return NA.

If we want to group by multiple variables, for example species and island, we just need to pass a vector of the variables to the .by = argument.

penguins |>
  drop_na() |> # Remove rows with missing values
  summarise(
    across(where(is.numeric), mean),
    .by = c(island, species) # Group by island and species
  )

# A tibble: 5 × 7
  island    species   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <fct>              <dbl>         <dbl>             <dbl>       <dbl>
1 Torgersen Adelie              39.0          18.5              192.       3709.
2 Biscoe    Adelie              39.0          18.4              189.       3710.
3 Dream     Adelie              38.5          18.2              190.       3701.
4 Biscoe    Gentoo              47.6          15.0              217.       5092.
5 Dream     Chinstrap           48.8          18.4              196.       3733.
# ℹ 1 more variable: year <dbl>

To compute the mean of all variables grouped by species we can use the aggregate() function. This is included in base R, so there is no need to load any packages.

# Remove rows with missing values 
penguins_clean <- na.omit(penguins)

aggregate(penguins_clean, list(species = penguins_clean$species), mean)

    species species island bill_length_mm bill_depth_mm flipper_length_mm
1    Adelie      NA     NA       38.82397      18.34726          190.1027
2 Chinstrap      NA     NA       48.83382      18.42059          195.8235
3    Gentoo      NA     NA       47.56807      14.99664          217.2353
  body_mass_g sex     year
1    3706.164  NA 2008.055
2    3733.088  NA 2007.971
3    5092.437  NA 2008.067

Be aware that all non-numeric variables will return NA (and thus a lot of warnings). Also, the na.omit() function removes all rows with missing values - if this is not done all columns with missing values will return NA.

If we want to group by multiple variables, for example species and island, we just need to pass another grouping variable to the list.

aggregate(penguins_clean, 
          list(species = penguins_clean$species,
               island = penguins_clean$island), 
          mean)

    species    island species island bill_length_mm bill_depth_mm
1    Adelie    Biscoe      NA     NA       38.97500      18.37045
2    Gentoo    Biscoe      NA     NA       47.56807      14.99664
3    Adelie     Dream      NA     NA       38.52000      18.24000
4 Chinstrap     Dream      NA     NA       48.83382      18.42059
5    Adelie Torgersen      NA     NA       39.03830      18.45106
  flipper_length_mm body_mass_g sex     year
1          188.7955    3709.659  NA 2008.136
2          217.2353    5092.437  NA 2008.067
3          189.9273    3701.364  NA 2008.018
4          195.8235    3733.088  NA 2007.971
5          191.5319    3708.511  NA 2008.021

A.4.2 Create a summary table

Sometimes we want to create a summary table grouped by some variable. This can be done in the following ways.

Using Tidyverse
Using base R

Important

If the pipe operator |> is unkown to you see Section A.2.

To compute a summary table of a numeric variable we can use the summarise() function from the dplyr package in Tidyverse. Remember to load the tidyverse package (library(tidyverse)) - this package includes dplyr as well as a whole lot of other nice packages.

For this example we want some summary statistics for the body_mass_g variable grouped by species.

penguins |>
  drop_na() |> # Remove rows with missing values
  summarise(
    N = length(body_mass_g),
    Mean = mean(body_mass_g),
    Median = median(body_mass_g),
    Std = sd(body_mass_g),
    IQR = IQR(body_mass_g),
    .by = species # Group by species
  )

# A tibble: 3 × 6
  species       N  Mean Median   Std   IQR
  <fct>     <int> <dbl>  <dbl> <dbl> <dbl>
1 Adelie      146 3706.   3700  459.  638.
2 Gentoo      119 5092.   5050  501.  800 
3 Chinstrap    68 3733.   3700  384.  462.

Be aware that the drop_na() function removes all rows with missing values - if this is not done all columns with missing values will return NA.

If we want to group by multiple variables, for example species and island, we just need to pass another grouping variable to the list.

penguins |>
  drop_na() |> # Remove rows with missing values
  summarise(
    N = length(body_mass_g),
    Mean = mean(body_mass_g),
    Median = median(body_mass_g),
    Std = sd(body_mass_g),
    IQR = IQR(body_mass_g),
    .by = c(island, species) # Group by island and species
  )

# A tibble: 5 × 7
  island    species       N  Mean Median   Std   IQR
  <fct>     <fct>     <int> <dbl>  <dbl> <dbl> <dbl>
1 Torgersen Adelie       47 3709.   3700  452.  662.
2 Biscoe    Adelie       44 3710.   3750  488.  588.
3 Dream     Adelie       55 3701.   3600  449.  588.
4 Biscoe    Gentoo      119 5092.   5050  501.  800 
5 Dream     Chinstrap    68 3733.   3700  384.  462.

To compute a summary table of a numeric variable we can use the aggregate() function. This is included in base R, so there is no need to load any packages.

For this example we want some summary statistics for the body_mass_g variable grouped by species.

# Remove rows with missing values 
penguins_clean <- na.omit(penguins)

# Define variables for grouping
my_groups <- list(species = penguins_clean$species)

# Compute summary stats
penguins_n <- aggregate(penguins_clean, my_groups, length)
penguins_mean <- aggregate(penguins_clean, my_groups, mean)
penguins_median <- aggregate(penguins_clean, my_groups, mean)
penguins_std <- aggregate(penguins_clean, my_groups, mean)
penguins_iqr <- aggregate(penguins_clean, my_groups, mean)

# Collect everything in a dataframe
penguins_summary <- data.frame(
  "Species" = penguins_n$species,
  "N" = penguins_n$body_mass_g,
  "Mean" = penguins_mean$body_mass_g,
  "Median" = penguins_median$body_mass_g,
  "Std" = penguins_std$body_mass_g,
  "IQR" = penguins_iqr$body_mass_g
  )

print(penguins_summary)

    Species   N     Mean   Median      Std      IQR
1    Adelie 146 3706.164 3706.164 3706.164 3706.164
2 Chinstrap  68 3733.088 3733.088 3733.088 3733.088
3    Gentoo 119 5092.437 5092.437 5092.437 5092.437

If we want to group by multiple variables, for example species and island, we just need to pass another grouping variable to the my_groups list.

# Define variables for grouping
my_groups <- list(species = penguins_clean$species,
                  island = penguins_clean$island)

# Compute summary stats
penguins_n <- aggregate(penguins_clean, my_groups, length)
penguins_mean <- aggregate(penguins_clean, my_groups, mean)
penguins_median <- aggregate(penguins_clean, my_groups, mean)
penguins_std <- aggregate(penguins_clean, my_groups, mean)
penguins_iqr <- aggregate(penguins_clean, my_groups, mean)

# Collect everything in a dataframe
penguins_summary <- data.frame(
  "Species" = penguins_n$species,
  "Island" = penguins_n$island,
  "N" = penguins_n$body_mass_g,
  "Mean" = penguins_mean$body_mass_g,
  "Median" = penguins_median$body_mass_g,
  "Std" = penguins_std$body_mass_g,
  "IQR" = penguins_iqr$body_mass_g
)

print(penguins_summary)

    Species    Island   N     Mean   Median      Std      IQR
1    Adelie    Biscoe  44 3709.659 3709.659 3709.659 3709.659
2    Gentoo    Biscoe 119 5092.437 5092.437 5092.437 5092.437
3    Adelie     Dream  55 3701.364 3701.364 3701.364 3701.364
4 Chinstrap     Dream  68 3733.088 3733.088 3733.088 3733.088
5    Adelie Torgersen  47 3708.511 3708.511 3708.511 3708.511