Data wrangling

Day 1: Data wrangling

Intro to the data set and class set up (15 min)

We will be working with data from the Portal Project Teaching Database.
All data is available on figshare.
It is a simplified version of data from study Ernest etal. 2016. Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA.
See Portal, Arizona in the map.
The original database is published at Ecological Archives
Long-term experimental study, showing a time-series for a small mammal community in southern Arizona. This is part of a project studying the effects of rodents and ants on the plant community that has been running for almost 40 years. The rodents are sampled on a series of 24 plots, with different experimental manipulations controlling which rodents are allowed to access which plots.
- 24 experimental and control plots
- Experimental manipulations over the years include removal of all or some rodent species, all or some ants, seed additions, and various alterations of the annual plant community.
The dataset is composed of three tables, which includes information on the site, date, species identification, weight and sampling plot (within the site) for some small mammals in Arizona.
Each table is stored as a CSV file.
CSV stands for “comma separated values”
CSV is common way of storing data that can be used across programming and data management software
If we look at one of these files we can see that
- It is plain text, so any program can read it
- The first row is the header row, with different column headers separated by commas
- All of the other rows are the data, again with different columns separated by commas
- And so each of the values is separated by commas, hence “comma separated values”

Setup your RStudio project

Remember! Working in projects is considered a best practice in data science.
Open your RStudio project. I called mine data-science.
Click on the names of the following files to download them:
- surveys.csv - main table, one row for each rodent captured, date on date, location, species ID, sex, and weight in grams and hindfoot length in millimeters.
- species.csv - latin species names for each species ID + general taxon
- plots.csv - information on the experimental manipulations at the site
Save the three CSV files into your “data-raw” folder.
- Don’t know where the CSV files are? Right click -> Save link as.
Create a new Rmd file and save it in the “documents” folder with the name “wrangling-portal.Rmd”.
Use markdown syntax to create a description of this dataset.

Loading and viewing the dataset

Load the three CSV files into R using the function read.csv() from the package utils.
Remember to use relative paths.

surveys <- read.csv("surveys.csv")
species <- read.csv("species.csv")
plots <- read.csv("plots.csv")

Once loaded you can display a data table by clicking on it in the “Environment” tab.
This data set is a great example of a good tabular data structure
- One table per type of data
  - Tables can be linked together to combine information.
- Each row contains a single record.
  - A single observation or data point
- Each column or field contains a single attribute.
  - A single type of information

The `dplyr` R package

A modern library of R functions for data manipulation
https://dplyr.tidyverse.org/
“dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges”.
When in doubt, use the cheat sheet:

Installing vs loading packages

Reminder: Packages are the main way that reusable code is shared in R
Combination of code, data, and documentation
R has a rich ecosystem of packages for data manipulation & analysis
Download and install packages with the R console:
- install.packages("dplyr")
Even if we’ve installed a package it is NOT automatically available to do analysis with
This is because sometimes different packages have functions with the same name
So don’t want to have to worry about functions and packages we’ve installed every time we write a piece of code
To use a package:
- Load all of the functions in the package with library("dplyr")
- Use the syntax package_name::function_name

Selecting columns

Create a new data frame with your columns of interest, in any order you prefer using the function select().
For example, get a new data frame with the column “year”, “month” and “day” from the surveys data frame:

select(surveys, year, month, day)

The function select() does not modifiy the original object. Check this by comparing the output we just produced to surveys:

head(surveys)

Columns will follow the same order given in the function:

select(surveys, month, day, year)

Mutating data

The function mutate() creates new data columns based on data from existing columns
It allows to calculate new values using data from the original columns
For example, the column "hindfoot_length" stores measurements in millimeters (mm). To create a new column showing the length of the hindfoot in centimeters (cm), we have to divide the values in mm from the original column "hindfoot_length" by 10:

mutate(surveys, hindfoot_length_cm = hindfoot_length / 10)

Similar to select(), when we open surveys, it does not contain a new column, because the function mutate() does not modify the original object; it produces a new object with the new column.
To store the new data frame for later use we need to assign it to an object:

surveys_plus <- mutate(surveys,
                       hindfoot_length_cm = hindfoot_length / 10)

Or we could overwrite the existing data frame if we don’t need it

surveys <- mutate(surveys,
                  hindfoot_length_cm = hindfoot_length / 10)

Arranging (sorting) data

We can sort the data in the table using the function arrange()
Let’s sort the surveys table by the values on the column "weight":

arrange(surveys, weight)

We can also sort by multiple columns, so if we wanted to sort first by plot_id and then by date

arrange(surveys, plot_id, year, month, day)

Similar to select() and mutate(), the function arrange() does not modifiy the original object. Check this by comparing the output we just produced to surveys:

head(surveys)

We can reverse the order of the sort by wrapping "weight" in the function desc() (for “descending order”):

arrange(surveys, desc(weight))

Filtering values

Use filter() to keep only the rows that meet certain criteria.
We use relational or logical statements to evaluate and establish the criteria to filter on.
Some examples of relational statements are:

== 1
== 2
!= 2
> 2
> 1
>= 1
< 2
<= 2
"A" == "A"
"A" == "a"
"A" != "a"

As with all other function that we have seen, the first argument specifies the data set we want to work with
The second and following arguments are relational/conditional statements that specify the filtering criteria.
For usage inside the function filter(), relational/conditional statements are coded as: column name relational/conditional statement value
For example, to filter the surveys data frame to only keep the data with the species id “DS”:
- Type the name of the function and open parentheses, filter()
- The name of the data frame we want to filter, surveys
- The column the want to filter on, "species_id"
- The relational statement, which is == for is equal to
- And then the value we want to evaluate, "DS"
- "DS" here is a string, not a variable or a column name, so we enclose it in quotation marks

filter(surveys, species_id == "DS")

Like the examples above, we can have a relational statement that is not equal to using !=.
For example, to filter the data for all species except “DS”:

filter(surveys, species_id != "DS")

We can also filter on multiple criteria at once by adding more relational statements
To indiciate that we want all the criteria to be TRUE, we have two ways:
1. We can add more criteria for filtering by separating them with a comma ,.
For example, to get the data on species "DS" for the year 1995 and above:

filter(surveys, species_id == "DS", year > 1995)

Alternatively, we can use the ampersand & symbol, which is called the AND operator:

filter(surveys, species_id == "DS" & year > 1995)

Combining criteria with a comma , or with the AND operator & indicates that we want all criteria to be met.
To indicate that not all statements have to be true (that is, one or more of the statements can be true), we combine statements using the vertical bar symbol |, which is called the OR operator.
For example, to get data for all of the Dipodomys species, with different species id (“DS”, “DM”, and “DO”)

filter(surveys, species_id == "DS" | species_id == "DM" | species_id == "DO")

The & and | allow building filtering criteria that are as complex as needed.

Filtering missing values

NA is a special value in R that represents data that is missing and should be treated differently.
Think of NA as “I do not know what’s there”.
One of the common tasks we use filter for is removing missing values from data
We know that the column weight has many missing values, so let’s filter out these values.
Based on what we learned before it’s natural to think that we do this by using the condition weight != NA

filter(surveys, weight != NA)

But that does not work. Let’s try some equivalences:

NA > 3  # is obviously NA because we don't know if the missing value is larger than 3 or not
NA == NA  # the same with this, we have two missing values but the true values could be quite different, so the correct answer is "I don't know."
surveys$weight == NA

Null values like NA are special: they are incomparable.
We don’t want to accidentally say that two “missing” things are the same, because we don’t know if they are
So we use special functions
The function is.na() checks if a value is NA

is.na(NA)
is.na(3)
is.na(surveys$weight)

filter(surveys, is.na(weight))

To remove null values we combine the function is.na() with the NOT operator !:

is.na(3)
!is.na(3)
filter(surveys, !is.na(weight))

Keep the function is.na() in mind. You will use it often to filter missing values in combination with other statements.
For example, we might want all of the data on a particular species that has weight data:

filter(surveys, species_id == "DS", !is.na(weight))

Solo In-class Exercise (30 min)

Exercise 1: Data manipulation

Load surveys.csv into R using read.csv().
Use select() to create a new data frame object called surveys1 with just the year, month, day, and species_id columns in that order.
Create a new data frame called surveys2 with the year, species_id, and weight in kilograms of each individual, with no null weights. Use mutate(), select(), and filter() with !is.na(). The weight in the table is given in grams so you will need to create a new column called “weight_kg” for weight in kilograms by dividing the weight column by 1000.
Use the filter() function to get all of the rows in the data frame surveys2 for the species ID “SH”.

Day 2: Pipes

Setup your RStudio project

Create a new Rmd file and save it in the “documents” folder with the name “wrangling-pipes.Rmd”.

The usual analysis workflow: intermediate variables and nesting functions

We usually use intermediate variables when writing code
- We run one line of code for each action in sequential order
- Store the output in a variable
- Use that variable later in the code
- Repeat
For example, to get the square root of the mean of a vector of numbers, we would run the mean() function with a numeric vector as the input, assign it to a variable, and then run the function sqrt() using that newly created object as input:

x = c(1, 2, 3)
mean_x <- mean(x)
sqrt_x <- sqrt(mean_x)
sqrt_x

And, in nested form:

sqrt(mean(x = c(1,2,3)))

Pipes

Intermediate variables and nested functions can get cumbersome (and confusing) if there are lots of steps.
%>% or |> (the pipe operator) takes the output of one command and passes it as value for the first argument of the next command.
The |> pipe will work everywhere as long as you have a new enough version of R and RStudio
%>% is the original pipe in R. It is called the “magrittr” pipe, and you have to load the magrittr package to use it (this gets loaded automatically by dplyr)
magrittr has some fancier functionality that may be useful in some cases.
You can use any pipe you like.
You can change the native behaviour to give the base R pipe with: Tools -> Global Options -> Code -> Use native pipe operator
To calculate the mean of a numeric vector, we can directly pipe the vector into the function mean(), no need to create an object:
```
c(1, 2, 3) %>% mean()
```
The numeric vector becomes the first argument in mean
We can add other arguments to the functions as needed. For example, to remove missing values from the mean:
```
x = c(1, 2, 3, NA)
mean(x, na.rm = TRUE)
x %>% mean(na.rm = TRUE)
```
For example, taking the mean of surveys$weight using a pipe gives the same result as doing it using the classic sequential form:
```
mean(surveys$weight, na.rm = TRUE)
surveys$weight |>
mean(na.rm = TRUE)
```
The usefulness (and clarity) of pipes becomes apparent when trying to run more complicated analysis, that require creating intermediate variables or nesting code:
```
c(1, 2, 3, NA) |>
mean(na.rm = TRUE) |>
sqrt()
```

Assigning the output of a pipe

There are two options to assign the output of a pipe to an object/variable name. We can do the assignment at the beginning of the pipe or at the end of it.

⬅️ At the beginning of the operation, the assignment goes from right to left:

my_result <- c(1, 2, 3, NA) |>
  mean(na.rm = TRUE) |>
  sqrt()

➡️ At the end of the operation, the assignment goes from left to right:

c(1, 2, 3, NA) |>
  mean(na.rm = TRUE) |>
  sqrt() -> my_result

Joint In-class exercise

Exercise 2: Data manipulation with pipes

This is a follow up for Exercise 1. Basically, you have to redo Exercise 1 but using pipes (either |> or %>%) instead of nested or sequential code with intermediate variable assignation.

Load surveys.csv into R using read.csv().
Use select() to create a new data frame object called surveys1 with just the year, month, day, and species_id columns in that order.
Create a new data frame called surveys2 with the year, species_id, and weight in kilograms of each individual, with no null weights. Use mutate(), select(), and filter() with !is.na(). The weight in the table is given in grams so you will need to create a new column called weight_kg that stored the weight in kilograms by dividing the weight column by 1000.
Use the filter() function to get all of the rows in the data frame surveys2 for the species ID "SH".

A minute feedback for class 15

Please provide some quick feedback for this session here.

Homework exercise

Exercise 3: Pipe practice

The following code is written using intermediate variables. It obtains the data for "DS" in the "species_id" column, sorted by year, with only the year and weight columns. Write the same code to get the same output but using pipes instead.

ds_data <- filter(surveys,
                  species_id == "DS",
                  !is.na(weight))
ds_data_by_year <- arrange(ds_data, year)
ds_weight_by_year <- select(ds_data_by_year,
                            year,
                            weight)

Day 3: Pipes-continued

Review Visualization Homework Exercise 4 (15 min)

Review of Fitting models with multiple data sets

Review Data Wrangling Homework Exercise 3 (15 min)

Review of Pipe practice

Challenge of the day: Using the pipe shortcut

Shortcut to get the pipe: Ctrl-shift-m.

What if I want to pipe to an argument other than the first argument?

To indicate the position of the piped argument we use the underscore symbol _ for the |> pipe and the dot symbol . for the magrittr pipe (%>%).
This is used when the output goes to an argument other than the first argument of the function.
For example, the function lm() fits a linear model relationship between two variables.
This function takes the formula = as the first argument and the data = containing our two variables as the second argument.
For example, to evaluate if weight has decreased or increased with time, in sequential/nested form:
```
lm(weight ~ year, data = surveys)
```

And using pipes and the placeholder:

surveys |>
 lm(weight ~ year, data = _)

surveys %>%
 lm(weight ~ year, data = .)

We can get the summary of the fitted model with the function summary(); in nested form:
```
summary(lm(weight ~ year, data = surveys))
```

Solo In-class exercise

Exercise 4: Piping placeholders

Use pipes to evaluate and summarize the relationship between weight and year for the species "DS". Make sure that you filter for missing values in weight. The code in sequential form would look like the following:

surveys_DS <- filter(surveys,
                     species_id == "DS",
                     !is.na(weight))
surveys_DS_lm <- lm(weight ~ year,
                    data = surveys_DS)
summary(surveys_DS_lm)

Data grouping (also called data aggregation)

In data analysis, it is common to want for summary statistics of variables based on belonging to a certain group. “Is the average height of one species the same as for a different species?”

A major strength of dplyr is the ability to group the data by a variable or variables and then operate on the data “by group”. In this way, data manipulations can be done on groups defined by variables.

Basic grouping

The function group_by() combines rows into groups based on ONE or MORE columns.
group_by() function has two arguments: 1) data to work on; 2) name of column (or columns) to group by
For example, to group surveys by year:

group_by(surveys, year)

The output is a tibble, a different looking kind of data.frame.
- The tibble contains information on the data source, the groupings, and the data type.
We can also group by multiple columns.
For example, group by “plot id” and by “year”:

group_by(surveys, plot_id, year)

Summarizing data from groupings

After grouping a data frame we can create a new table that summarizes information for each group. To do this, we use the function summarize():
- The summarize() function takes as first argument a grouped table, an output of group_by()
- Then, it takes one additional argument for each new column with summarized data you want to add.
- The format of this arguments is New column name, followed by the equal sign =, finished by the function that will calculate a summary statistic that we want to obtain for each group.
- We can apply all the usual summary statistics functions (mean, median, min, max, sum), but a very useful one that is unique to the package dplyr is n()
The function n() counts the number of rows for each grouping. It is a special function that only works within dplyr functions.
For example, to get the abundance of individuals per year, we will use the function n() as follows:

surveys_by_year <- group_by(surveys, year)
year_counts <- summarize(surveys_by_year, abundance = n())

We can also count the number of individuals in each plot per year:

surveys_by_plot_year <- group_by(surveys, plot_id, year)
plot_year_counts <- summarize(surveys_by_plot_year, abundance = n())

Just like with other dplyr functions, we could write this using pipes instead

plot_year_counts <- surveys |>
  group_by(plot_id, year) |>
  summarize(abundance = n())

Let’s use other summary statistic functions that returns a single value from a vector; e.g., mean(), max(), min(), sum().
For example, we can calculate the number of individuals in each plot-year group and their average weight:

surveys |>
  group_by(plot_id, year) |>
  summarize(abundance = n(), avg_weight = mean(weight))

We get all NA in the “avg_weight” column :/
Remember: mean() (and max, min and sum) returns NA when the input vector has missing values (NAs)
To fix this we use the argument na.rm = TRUE:
```
mean(surveys$weight, na.rm = TRUE)
```

Now, in piped form:

surveys |>
group_by(plot_id, year) |>
summarize(abundance = n(),
          avg_weight = mean(weight, na.rm = TRUE))

We still get missing values (in the form of NaN) because there are species that have not been weighed, not even once! So summary() introduces new NA values.
We can filter() this using !is.na()

surveys |>
  group_by(plot_id, year) |>
  summarize(abundance = n(),
            avg_weight = mean(weight, na.rm = TRUE)) |>
  filter(!is.na(avg_weight))

Solo In-class Exercise

Exercise 5: Data aggregation

Use the group_by() and summarize() functions to get a count of the number of individuals in each species ID.
Use the group_by() and summarize() functions to get a count of the number of individuals in each species ID in each year.
Use the filter(), group_by(), and summarize() functions to get the mean mass of species DO in each year.

A minute feedback for class 16

Please provide some quick feedback for this session here.

Home exercises

Do at least one of the Shrub Volume Data Set exercises.

UC Merced -
Spring 2023

Lecture

Day 1: Data wrangling

Intro to the data set and class set up (15 min)

Setup your RStudio project

Loading and viewing the dataset

The `dplyr` R package

Installing vs loading packages

Selecting columns

Mutating data

Arranging (sorting) data

Filtering values

Filtering missing values

Solo In-class Exercise (30 min)

Day 2: Pipes

Setup your RStudio project

The usual analysis workflow: intermediate variables and nesting functions

Pipes

Assigning the output of a pipe

Joint In-class exercise

Homework exercise

Day 3: Pipes-continued

Review Visualization Homework Exercise 4 (15 min)

Review Data Wrangling Homework Exercise 3 (15 min)

Challenge of the day: Using the pipe shortcut

What if I want to pipe to an argument other than the first argument?

Solo In-class exercise

Data grouping (also called data aggregation)

Basic grouping

Summarizing data from groupings

Solo In-class Exercise

Home exercises

UC Merced - Spring 2023

Data wrangling

Lecture

Day 1: Data wrangling

Intro to the data set and class set up (15 min)

Setup your RStudio project

Loading and viewing the dataset

The dplyr R package

Installing vs loading packages

Selecting columns

Mutating data

Arranging (sorting) data

Filtering values

Filtering missing values

Solo In-class Exercise (30 min)

Day 2: Pipes

Setup your RStudio project

The usual analysis workflow: intermediate variables and nesting functions

Pipes

Assigning the output of a pipe

Joint In-class exercise

Homework exercise

Day 3: Pipes-continued

Review Visualization Homework Exercise 4 (15 min)

Review Data Wrangling Homework Exercise 3 (15 min)

Challenge of the day: Using the pipe shortcut

What if I want to pipe to an argument other than the first argument?

Solo In-class exercise

Data grouping (also called data aggregation)

Basic grouping

Summarizing data from groupings

Solo In-class Exercise

Home exercises

UC Merced -
Spring 2023

The `dplyr` R package