Day 1: Data wrangling

Intro to the data set and class set up (15 min)

Setup your RStudio project

Loading and viewing the dataset

surveys <- read.csv("surveys.csv")
species <- read.csv("species.csv")
plots <- read.csv("plots.csv")

The dplyr R package

Installing vs loading packages

Selecting columns

select(surveys, year, month, day)
head(surveys)
select(surveys, month, day, year)

Mutating data

mutate(surveys, hindfoot_length_cm = hindfoot_length / 10)
surveys_plus <- mutate(surveys,
                       hindfoot_length_cm = hindfoot_length / 10)
surveys <- mutate(surveys,
                  hindfoot_length_cm = hindfoot_length / 10)

Arranging (sorting) data

arrange(surveys, weight)
arrange(surveys, plot_id, year, month, day)
head(surveys)
arrange(surveys, desc(weight))

Filtering values

1 == 1
1 == 2
1 != 2
1 > 2
1 > 1
1 >= 1
1 < 2
1 <= 2
"A" == "A"
"A" == "a"
"A" != "a"
filter(surveys, species_id == "DS")
filter(surveys, species_id != "DS")
filter(surveys, species_id == "DS", year > 1995)
  1. Alternatively, we can use the ampersand & symbol, which is called the AND operator:
filter(surveys, species_id == "DS" & year > 1995)
filter(surveys, species_id == "DS" | species_id == "DM" | species_id == "DO")

Filtering missing values

filter(surveys, weight != NA)
NA > 3  # is obviously NA because we don't know if the missing value is larger than 3 or not
NA == NA  # the same with this, we have two missing values but the true values could be quite different, so the correct answer is "I don't know."
surveys$weight == NA
is.na(NA)
is.na(3)
is.na(surveys$weight)
filter(surveys, is.na(weight))
is.na(3)
!is.na(3)
filter(surveys, !is.na(weight))
filter(surveys, species_id == "DS", !is.na(weight))

Solo In-class Exercise (30 min)

Exercise 1: Data manipulation

  1. Load surveys.csv into R using read.csv().
  2. Use select() to create a new data frame object called surveys1 with just the year, month, day, and species_id columns in that order.
  3. Create a new data frame called surveys2 with the year, species_id, and weight in kilograms of each individual, with no null weights. Use mutate(), select(), and filter() with !is.na(). The weight in the table is given in grams so you will need to create a new column called “weight_kg” for weight in kilograms by dividing the weight column by 1000.
  4. Use the filter() function to get all of the rows in the data frame surveys2 for the species ID “SH”.


Day 2: Pipes

Setup your RStudio project

The usual analysis workflow: intermediate variables and nesting functions

x = c(1, 2, 3)
mean_x <- mean(x)
sqrt_x <- sqrt(mean_x)
sqrt_x
sqrt(mean(x = c(1,2,3)))

Pipes

Assigning the output of a pipe

There are two options to assign the output of a pipe to an object/variable name. We can do the assignment at the beginning of the pipe or at the end of it.

⬅️ At the beginning of the operation, the assignment goes from right to left:

my_result <- c(1, 2, 3, NA) |>
  mean(na.rm = TRUE) |>
  sqrt()

➡️ At the end of the operation, the assignment goes from left to right:

c(1, 2, 3, NA) |>
  mean(na.rm = TRUE) |>
  sqrt() -> my_result

Joint In-class exercise

Exercise 2: Data manipulation with pipes

This is a follow up for Exercise 1. Basically, you have to redo Exercise 1 but using pipes (either |> or %>%) instead of nested or sequential code with intermediate variable assignation.

  1. Load surveys.csv into R using read.csv().
  2. Use select() to create a new data frame object called surveys1 with just the year, month, day, and species_id columns in that order.
  3. Create a new data frame called surveys2 with the year, species_id, and weight in kilograms of each individual, with no null weights. Use mutate(), select(), and filter() with !is.na(). The weight in the table is given in grams so you will need to create a new column called weight_kg that stored the weight in kilograms by dividing the weight column by 1000.
  4. Use the filter() function to get all of the rows in the data frame surveys2 for the species ID "SH".

A minute feedback for class 15

Homework exercise

Exercise 3: Pipe practice

The following code is written using intermediate variables. It obtains the data for "DS" in the "species_id" column, sorted by year, with only the year and weight columns. Write the same code to get the same output but using pipes instead.

ds_data <- filter(surveys,
                  species_id == "DS",
                  !is.na(weight))
ds_data_by_year <- arrange(ds_data, year)
ds_weight_by_year <- select(ds_data_by_year,
                            year,
                            weight)


Day 3: Pipes-continued

Review Visualization Homework Exercise 4 (15 min)

Review Data Wrangling Homework Exercise 3 (15 min)

Challenge of the day: Using the pipe shortcut


What if I want to pipe to an argument other than the first argument?

surveys %>%
 lm(weight ~ year, data = .)

Solo In-class exercise

Exercise 4: Piping placeholders

Use pipes to evaluate and summarize the relationship between weight and year for the species "DS". Make sure that you filter for missing values in weight. The code in sequential form would look like the following:

surveys_DS <- filter(surveys,
                     species_id == "DS",
                     !is.na(weight))
surveys_DS_lm <- lm(weight ~ year,
                    data = surveys_DS)
summary(surveys_DS_lm)

Data grouping (also called data aggregation)

In data analysis, it is common to want for summary statistics of variables based on belonging to a certain group. “Is the average height of one species the same as for a different species?”

A major strength of dplyr is the ability to group the data by a variable or variables and then operate on the data “by group”. In this way, data manipulations can be done on groups defined by variables.

Basic grouping

group_by(surveys, year)
group_by(surveys, plot_id, year)

Summarizing data from groupings

surveys_by_year <- group_by(surveys, year)
year_counts <- summarize(surveys_by_year, abundance = n())
surveys_by_plot_year <- group_by(surveys, plot_id, year)
plot_year_counts <- summarize(surveys_by_plot_year, abundance = n())
plot_year_counts <- surveys |>
  group_by(plot_id, year) |>
  summarize(abundance = n())
surveys |>
  group_by(plot_id, year) |>
  summarize(abundance = n(), avg_weight = mean(weight))
surveys |>
  group_by(plot_id, year) |>
  summarize(abundance = n(),
            avg_weight = mean(weight, na.rm = TRUE)) |>
  filter(!is.na(avg_weight))

Solo In-class Exercise

Exercise 5: Data aggregation

  1. Use the group_by() and summarize() functions to get a count of the number of individuals in each species ID.
  2. Use the group_by() and summarize() functions to get a count of the number of individuals in each species ID in each year.
  3. Use the filter(), group_by(), and summarize() functions to get the mean mass of species DO in each year.

A minute feedback for class 16

Home exercises