Day 1: Data wrangling

Intro to the data set and class set up (15 min)

Setup your RStudio project

Loading and viewing the dataset

surveys <- read.csv("surveys.csv")
species <- read.csv("species.csv")
plots <- read.csv("plots.csv")

The dplyr R package

Installing vs loading packages

Selecting columns

select(surveys, year, month, day)
head(surveys)
select(surveys, month, day, year)

Mutating data

mutate(surveys, hindfoot_length_cm = hindfoot_length / 10)
surveys_plus <- mutate(surveys,
                       hindfoot_length_cm = hindfoot_length / 10)
surveys <- mutate(surveys,
                  hindfoot_length_cm = hindfoot_length / 10)

Arranging (sorting) data

arrange(surveys, weight)
arrange(surveys, plot_id, year, month, day)
head(surveys)
arrange(surveys, desc(weight))

Filtering values

1 == 1
1 == 2
1 != 2
1 > 2
1 > 1
1 >= 1
1 < 2
1 <= 2
"A" == "A"
"A" == "a"
"A" != "a"
filter(surveys, species_id == "DS")
filter(surveys, species_id != "DS")
filter(surveys, species_id == "DS", year > 1995)
  1. Alternatively, we can use the ampersand & symbol, which is called the AND operator:
filter(surveys, species_id == "DS" & year > 1995)
filter(surveys, species_id == "DS" | species_id == "DM" | species_id == "DO")

Filtering missing values

filter(surveys, weight != NA)
NA > 3  # is obviously NA because we don't know if the missing value is larger than 3 or not
NA == NA  # the same with this, we have two missing values but the true values could be quite different, so the correct answer is "I don't know."
surveys$weight == NA
is.na(NA)
is.na(3)
is.na(surveys$weight)
filter(surveys, is.na(weight))
is.na(3)
!is.na(3)
filter(surveys, !is.na(weight))
filter(surveys, species_id == "DS", !is.na(weight))

Solo In-class Exercise (30 min)

Exercise 1: Data manipulation

  1. Load surveys.csv into R using read.csv().
  2. Use select() to create a new data frame object called surveys1 with just the year, month, day, and species_id columns in that order.
  3. Create a new data frame called surveys2 with the year, species_id, and weight in kilograms of each individual, with no null weights. Use mutate(), select(), and filter() with !is.na(). The weight in the table is given in grams so you will need to create a new column called “weight_kg” for weight in kilograms by dividing the weight column by 1000.
  4. Use the filter() function to get all of the rows in the data frame surveys2 for the species ID “SH”.


Day 2: Pipes

The usual analysis workflow: intermediate variables and nesting functions

x = c(1, 2, 3)
mean_x <- mean(x)
sqrt_x <- sqrt(mean_x)
sqrt_x
sqrt(mean(x = c(1,2,3)))

Pipes

Joint In-class exercise 3

  1. Write the necessary code using intermediate variables to manipulate the data as follows:
    • (a) Use mutate(), select(), and filter() with is.na() to create a new data frame with the year, species_id, and weight in kilograms of each individual, with no null weights. Create a new data object called surveys1. Remember: The weight in the table is given in grams so you will need to create a new column called “weight_kg” for weight in kilograms by dividing the weight column by 1000.
    • (b) Use filter() with is.na() and select() to get the year, month, day, and species_id columns for all of the rows in the data frame where species_id is SH and with no null weights. Create a new data object called surveys2.
  2. Write the same code but using pipes (either |> or %>%).

Solo In-class exercise 4

The following code is written using intermediate variables. It obtains the data for “DS” in the “species_id” column, sorted by year, with only the year and weight columns. Write the same code to get the same output but using pipes instead.

ds_data <- filter(surveys, species_id == "DS", !is.na(weight))
ds_data_by_year <- arrange(ds_data, year)
ds_weight_by_year <- select(ds_data_by_year, year, weight)

What if I want to pipe to an argument other than the first argument?

surveys %>%
 lm(weight ~ year, data =.)

Solo In-class exercise 5

Use pipes to evaluate and summarize the relationship between weight and year for the species “DS”. Make sure that you filter for missing values in weight. The code in sequential form would look like the following:

surveys_DS <- filter(surveys, species_id == "DS", !is.na(weight))
surveys_DS_lm <- lm(weight ~ year, data = surveys_DS)
summary(surveys_DS_lm)

Data grouping (also called data agreggation)

Basic grouping

group_by(surveys, year)
group_by(surveys, plot_id, year)

Summarizing data from groupings

surveys_by_year <- group_by(surveys, year)
year_counts <- summarize(surveys_by_year, abundance = n())
surveys_by_plot_year <- group_by(surveys, plot_id, year)
plot_year_counts <- summarize(surveys_by_plot_year, abundance = n())
plot_year_counts <- surveys |>
  group_by(plot_id, year) |>
  summarize(abundance = n())
surveys |>
  group_by(plot_id, year) |>
  summarize(abundance = n(), avg_weight = mean(weight))
surveys |>
  group_by(plot_id, year) |>
  summarize(abundance = n(),
            avg_weight = mean(weight, na.rm = TRUE)) |>
  filter(!is.na(avg_weight))

Solo In-class Exercise 6

Exercise 3: Data agreggation

  1. Use the group_by() and summarize() functions to get a count of the number of individuals in each species ID.
  2. Use the group_by() and summarize() functions to get a count of the number of individuals in each species ID in each year.
  3. Use the filter(), group_by(), and summarize() functions to get the mean mass of species DO in each year.