<- c("A", "B", "C", "D", "E", "F")
x
4] x[
[1] "D"
Logical indexing is one of the most powerful data wrangling techniques. It uses logical operators like ==
(equal to) and !=
(not equal to) to subset data based on multiply criteria. Below are some examples to explain indexing, logical operators, and finally subsetting data with logical indexing.
If x
is a vector of the letters A through F in alphabetical order, then we can call the fourth letter (D) by the syntax x[4]
:
<- c("A", "B", "C", "D", "E", "F")
x
4] x[
[1] "D"
If x
is a matrix (or a dataframe) then we can index the data by the syntax x[row, column]
. For example
x[2, 3]
- Extract the data in row 2, column 3.x[ , 3]
- Extract all rows in column 3.x[2, ]
- Extract all columns in row 2.# Create a matrix with 3 columns
<- matrix(c("A", "B", "C", "D", "E", "F"), ncol = 3)
x
# Print full matrix
x
[,1] [,2] [,3]
[1,] "A" "C" "E"
[2,] "B" "D" "F"
# Row 2, column 3
2, 3] x[
[1] "F"
# All rows, column 3
3] x[ ,
[1] "E" "F"
# Row 2, all columns
2, ] x[
[1] "B" "D" "F"
Logical operators are basic statements that return TRUE
or FALSE
(table A.1).
Name | Operator | Returns TRUE |
Returns FALSE |
---|---|---|---|
Equal to | == |
"A" == "A" |
"A" == "B" |
Not equal to | != |
"A" != "B" |
"A" != "A" |
Less than | < |
2 < 3" |
3 < 2 |
Greater than | > |
15 > 10 |
10 > 15 |
Less than or equal to | <= |
10 <= 10 or 9 <= 10 |
10 <= 9 |
Greater than or equal to | >= |
22 >= 22 or 25 >= 22 |
20 >= 15 |
In | %in% |
"A" %in% c("A", "B" "C") |
"F" %in% c("A", "B" "C") |
If we have a vector x
with multiple instances of A
, B
and C
we can index it as:
# Create the vector
<- rep(c("A", "B", "C"), each = 3)
x
# Print the vector
x
[1] "A" "A" "A" "B" "B" "B" "C" "C" "C"
# Check where the elements of x equals "B"
== "B" x
[1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
If we want to check multiple statements at once we can use the AND &
and the OR |
operator.
# Check where the elements of x equals "B" OR "C"
== "A" | x == "C" x
[1] TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
# Check where the elements of x equals "B" AND "C"
== "A" & x == "C" x
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
It is possible to use a logical vector (i.e. a vector that only contains TRUE
and FALSE
) to index data. When passing a logical vector to a data object it returns the data in all the places where the vector equals TRUE
.
As an example, we load the palmerpenguins
dataset, inspect it and do some simple logical indexing.
# Inspect the data
head(penguins)
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
# Extract all rows where species equals "Adelie"
<- penguins[penguins$species == "Adelie", ]
x
head(x)
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
# Extract all rows where species equals "Adelie"
# AND body mass in greater than 4000
<- penguins[penguins$species == "Adelie" & penguins$body_mass_g > 4000, ]
x
head(x)
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 <NA> <NA> NA NA NA NA
2 Adelie Torgersen 39.2 19.6 195 4675
3 Adelie Torgersen 42 20.2 190 4250
4 Adelie Torgersen 34.6 21.1 198 4400
5 Adelie Torgersen 42.5 20.7 197 4500
6 Adelie Torgersen 46 21.5 194 4200
# ℹ 2 more variables: sex <fct>, year <int>
In R you have the opportunity to use a special syntax called the pipe operator |>
. It can be thought of as a pipe that “sends data downstream” to the next call in your script.
If we for example want to take the square root of some data df
and afterwards show the first 6 values of that data with head()
we can write
|>
df sqrt() |>
head()
which is equivalent to
head(sqrt(df))
This might not make much sense for this example, but when things get a bit more complex the pipe operator can really help by making code easier to read.
If we for example want to format some data df
and then plot it in ggplot2 we can write the following script
|>
df pivot_longer(
cols = everything(),
names_to = "group",
names_prefix = "X",
values_to = "response"
|>
) mutate(group = str_c("Group ", group)) |>
ggplot(aes(group, response)) +
geom_boxplot()
Without the pipe we would need to create several intermediate variables cluttering up our script and environment in the process.
“Tidy data” is a way of storing data in tables that is commonly used in R. A dataset is considered tidy when it adheres to these two principles:
It is easier to understand by looking at an example:
Let us imagine that we have conducted a weight loss study by weighing 10 persons each week for 4 weeks total. That is
\[ n = 10 \times 4 = 40 \text{ observations.} \]
The data from the study is loaded into R and displayed.
head(weight_data) # Show the first 6 rows of the data
person week1 week2 week3 week4
1 1 63.65420 51.82889 70.43640 69.64181
2 2 82.52186 57.75702 107.99418 71.58120
3 3 46.23159 56.47377 81.30373 60.50684
4 4 87.45990 94.92606 52.25914 75.58260
5 5 43.33043 73.33685 58.24029 85.13847
6 6 95.67937 54.84374 85.67053 92.48265
It is clear that this data does not currently conform to the principles of tidy data. The variable “Which week is this measurement from?” is spread out across the column names of the last 4 columns. That is not practical for working with it in R, and something as simple as plotting the data using ggplot2
would be almost impossible.
In the next section, we use a bit of code to convert this into a tidy data format.
Using Tidyverse (remember to load it with library(tidyverse)
), it is possible to pivot our current data frame into the longer tidy format.
<- weight_data |>
weight_data_tidy pivot_longer(
cols = -person, # Which columns to pivot (everything but "person")
names_to = "week", # Name of the column to store previous column names
values_to = "weight" # Name of the column to store data in
)
head(weight_data_tidy)
# A tibble: 6 × 3
person week weight
<int> <chr> <dbl>
1 1 week1 63.7
2 1 week2 51.8
3 1 week3 70.4
4 1 week4 69.6
5 2 week1 82.5
6 2 week2 57.8
Now the data has been converted into a tidy format. Each row is an observation, and each column is a variable describing a parameter of the observation:
person
is the person number.week
is the week of the weight measurement.weight
is the weight recorded in that observation.Sometimes we want to calculate a statistic per group. There are many different ways of doing this, and in these examples we are going to present a way possible ways.
For these examples we use the Palmer Penguins dataset. It consists of some data regarding the bill size, flipper length and body mass of three penguin species across three islands.
Let us start by taking a look at our data. This can be done via the head()
function or the glimpse()
function from the Tidyverse.
head(penguins) # Show the first 6 rows of dataframe
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
glimpse(penguins)
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
If we want to summarise one statistic, that could be the mean or the standard deviation, grouped by species or island (or both), we can do it in the following ways.
If the pipe operator |>
is unkown to you see Section A.2.
To compute the mean of all numeric variables we can use the summarise()
function from the dplyr package in Tidyverse. Remember to load the tidyverse package (library(tidyverse)
) - this package includes dplyr as well as a whole lot of other nice packages.
For this example we want the mean for all numeric variables grouped by species.
|>
penguins drop_na() |> # Remove rows with missing values
summarise(
across(
.cols = where(is.numeric), # Chose columns that are numeric
.fns = mean # Set the function we want to use
),.by = species # Group by species
)
# A tibble: 3 × 6
species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Adelie 38.8 18.3 190. 3706. 2008.
2 Gentoo 47.6 15.0 217. 5092. 2008.
3 Chinstrap 48.8 18.4 196. 3733. 2008.
Be aware that the function drops all non-numeric variables that are not part of the grouping. So the output of the above code is “missing” the island and sex variables. Also, the drop_na()
function removes all rows with missing values - if this is not done all columns with missing values will return NA
.
If we want to group by multiple variables, for example species and island, we just need to pass a vector of the variables to the .by =
argument.
|>
penguins drop_na() |> # Remove rows with missing values
summarise(
across(where(is.numeric), mean),
.by = c(island, species) # Group by island and species
)
# A tibble: 5 × 7
island species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 Torgersen Adelie 39.0 18.5 192. 3709.
2 Biscoe Adelie 39.0 18.4 189. 3710.
3 Dream Adelie 38.5 18.2 190. 3701.
4 Biscoe Gentoo 47.6 15.0 217. 5092.
5 Dream Chinstrap 48.8 18.4 196. 3733.
# ℹ 1 more variable: year <dbl>
To compute the mean of all variables grouped by species we can use the aggregate()
function. This is included in base R, so there is no need to load any packages.
# Remove rows with missing values
<- na.omit(penguins)
penguins_clean
aggregate(penguins_clean, list(species = penguins_clean$species), mean)
species species island bill_length_mm bill_depth_mm flipper_length_mm
1 Adelie NA NA 38.82397 18.34726 190.1027
2 Chinstrap NA NA 48.83382 18.42059 195.8235
3 Gentoo NA NA 47.56807 14.99664 217.2353
body_mass_g sex year
1 3706.164 NA 2008.055
2 3733.088 NA 2007.971
3 5092.437 NA 2008.067
Be aware that all non-numeric variables will return NA
(and thus a lot of warnings). Also, the na.omit()
function removes all rows with missing values - if this is not done all columns with missing values will return NA
.
If we want to group by multiple variables, for example species and island, we just need to pass another grouping variable to the list.
aggregate(penguins_clean,
list(species = penguins_clean$species,
island = penguins_clean$island),
mean)
species island species island bill_length_mm bill_depth_mm
1 Adelie Biscoe NA NA 38.97500 18.37045
2 Gentoo Biscoe NA NA 47.56807 14.99664
3 Adelie Dream NA NA 38.52000 18.24000
4 Chinstrap Dream NA NA 48.83382 18.42059
5 Adelie Torgersen NA NA 39.03830 18.45106
flipper_length_mm body_mass_g sex year
1 188.7955 3709.659 NA 2008.136
2 217.2353 5092.437 NA 2008.067
3 189.9273 3701.364 NA 2008.018
4 195.8235 3733.088 NA 2007.971
5 191.5319 3708.511 NA 2008.021
Sometimes we want to create a summary table grouped by some variable. This can be done in the following ways.
If the pipe operator |>
is unkown to you see Section A.2.
To compute a summary table of a numeric variable we can use the summarise()
function from the dplyr package in Tidyverse. Remember to load the tidyverse package (library(tidyverse)
) - this package includes dplyr as well as a whole lot of other nice packages.
For this example we want some summary statistics for the body_mass_g variable grouped by species.
|>
penguins drop_na() |> # Remove rows with missing values
summarise(
N = length(body_mass_g),
Mean = mean(body_mass_g),
Median = median(body_mass_g),
Std = sd(body_mass_g),
IQR = IQR(body_mass_g),
.by = species # Group by species
)
# A tibble: 3 × 6
species N Mean Median Std IQR
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 Adelie 146 3706. 3700 459. 638.
2 Gentoo 119 5092. 5050 501. 800
3 Chinstrap 68 3733. 3700 384. 462.
Be aware that the drop_na()
function removes all rows with missing values - if this is not done all columns with missing values will return NA
.
If we want to group by multiple variables, for example species and island, we just need to pass another grouping variable to the list.
|>
penguins drop_na() |> # Remove rows with missing values
summarise(
N = length(body_mass_g),
Mean = mean(body_mass_g),
Median = median(body_mass_g),
Std = sd(body_mass_g),
IQR = IQR(body_mass_g),
.by = c(island, species) # Group by island and species
)
# A tibble: 5 × 7
island species N Mean Median Std IQR
<fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>
1 Torgersen Adelie 47 3709. 3700 452. 662.
2 Biscoe Adelie 44 3710. 3750 488. 588.
3 Dream Adelie 55 3701. 3600 449. 588.
4 Biscoe Gentoo 119 5092. 5050 501. 800
5 Dream Chinstrap 68 3733. 3700 384. 462.
To compute a summary table of a numeric variable we can use the aggregate()
function. This is included in base R, so there is no need to load any packages.
For this example we want some summary statistics for the body_mass_g variable grouped by species.
# Remove rows with missing values
<- na.omit(penguins)
penguins_clean
# Define variables for grouping
<- list(species = penguins_clean$species)
my_groups
# Compute summary stats
<- aggregate(penguins_clean, my_groups, length)
penguins_n <- aggregate(penguins_clean, my_groups, mean)
penguins_mean <- aggregate(penguins_clean, my_groups, mean)
penguins_median <- aggregate(penguins_clean, my_groups, mean)
penguins_std <- aggregate(penguins_clean, my_groups, mean)
penguins_iqr
# Collect everything in a dataframe
<- data.frame(
penguins_summary "Species" = penguins_n$species,
"N" = penguins_n$body_mass_g,
"Mean" = penguins_mean$body_mass_g,
"Median" = penguins_median$body_mass_g,
"Std" = penguins_std$body_mass_g,
"IQR" = penguins_iqr$body_mass_g
)
print(penguins_summary)
Species N Mean Median Std IQR
1 Adelie 146 3706.164 3706.164 3706.164 3706.164
2 Chinstrap 68 3733.088 3733.088 3733.088 3733.088
3 Gentoo 119 5092.437 5092.437 5092.437 5092.437
If we want to group by multiple variables, for example species and island, we just need to pass another grouping variable to the my_groups
list.
# Define variables for grouping
<- list(species = penguins_clean$species,
my_groups island = penguins_clean$island)
# Compute summary stats
<- aggregate(penguins_clean, my_groups, length)
penguins_n <- aggregate(penguins_clean, my_groups, mean)
penguins_mean <- aggregate(penguins_clean, my_groups, mean)
penguins_median <- aggregate(penguins_clean, my_groups, mean)
penguins_std <- aggregate(penguins_clean, my_groups, mean)
penguins_iqr
# Collect everything in a dataframe
<- data.frame(
penguins_summary "Species" = penguins_n$species,
"Island" = penguins_n$island,
"N" = penguins_n$body_mass_g,
"Mean" = penguins_mean$body_mass_g,
"Median" = penguins_median$body_mass_g,
"Std" = penguins_std$body_mass_g,
"IQR" = penguins_iqr$body_mass_g
)
print(penguins_summary)
Species Island N Mean Median Std IQR
1 Adelie Biscoe 44 3709.659 3709.659 3709.659 3709.659
2 Gentoo Biscoe 119 5092.437 5092.437 5092.437 5092.437
3 Adelie Dream 55 3701.364 3701.364 3701.364 3701.364
4 Chinstrap Dream 68 3733.088 3733.088 3733.088 3733.088
5 Adelie Torgersen 47 3708.511 3708.511 3708.511 3708.511