Day 1: Joining data tables

Learning Objectives This week, students will learn to: - Explain the importance of joining multiple data tables. - Use the `dplyr` functions that join data tables. - Understand why data is dropped when joining tables - Use pipes to join more than two data tables - Use the ``%in%`` operator to find matching column names in two data tables **Practice Objectives** This week, students will practice: - Use of relational and logical statements to filter data tables - Handling missing values with `is.na()` and `na.rm =` - pipeline placeholders **Non Objectives** - `which()` - `match()`

Quizz (10 min)

Answer the following questions on the etherpad:

  1. What type of output does the summarize() function from dplyr give, a table or a vector?
  2. When using the summarize() function, does the output table has the same columns as the input data table or different?
  3. Mention 3 functions from the dplyr package that return tables with (at least) the same columns from the input table?

Setup your RStudio project (15 min)

Why do we need to join data tables? (5 min)

Joining two data Tables (5 min)

Exercise 1 (5 min)

Do the following calculations using a single pipe of code (no nested nor intermediate variables):

Check dropped data

Finding shared column names (colnames()) between tables (5 min)

Exercise 2 (5 min)

  1. Use intersect() to display the column name that is shared between the plots table and the surveys table. Use that column name for the next question.
  2. Do the following using a single pipe of code (no nested code nor intermediate variables):
    • Use function left_join() and filter() to get a data frame with the information from the surveys and plots tables where the "plot_type" is "Rodent Exclosure".

Joining two or more data Tables (5 min)

Exercise 3 (10 min)

We want to do an analysis comparing the size of individuals on the "Control" plots to the "Long-term Krat Exclosures".

A minute feedback for class 17

Homework



Day 2: Joining data vectors

Setup Your RStudio Project

A Relationship Between Data Frames and Vectors

Creating vectors

Review - Examples of numeric vectors: - with a single number ```r a_number <- 1 ``` - with two or more numbers, we can use the _concatenate_ function `c()` ```r c(1, 2, 3) # in order ``` - We can use the _colon_ `:` operator to create sequences of numbers ```r 1:3 ``` - With the `c()` function we can add numbers in any order we want ```r c(10, 1, 8) # random order ``` - But with `:` we can create sequences as long as we want to with just a few key strokes: ```r 1:10 1:100 1:4589567 ``` - The function `seq()` creates sequences with any step we specify (not only 1 as with `:`) ```r seq(from = 1, to = 100, by = 2) seq(from = 1, to = 100, by = 0.5) ``` - We can start numeric sequences at any number, in reverse order, and using negative numbers, - with the `:` operator: ```r 15:20 100:50 -100:50 5:-5 ``` - and with `seq()` (pay attention to the sign of the step (`by = ` argument)) ```r seq(15, 20) seq(100, 50, -2) seq(-100, 50, 2) ```
### Creating data frames from Vectors * The `data.frame()` function joins vectors into a single data frame * Each argument we provide will be a column in the data frame (just like in `mutate()` and `summarize()`!) * The arguments are taken as follows: - The name of the column we want in the data frame, - an equal sign `=`, and - the vector whose values we want in that column. * So we give it the arguments `sites = `, and `density = ` ```r density_data <- data.frame(sites = c("a", "a", "b", "c"), density = c(2.8, 3.2, 1.5, 3.8)) ``` * If we look in the **Global Environment** tab we can see that there is a new data frame called `density_data`, and that it has our two vectors as columns. * We could also make this data frame using the vectors that are already stored in variables: ```r sites <- c("a", "a", "b", "c") density <- c(2.8, 3.2, 1.5, 3.8) density_data <- data.frame(sites = sites, density = density) ``` * We can also add columns to the data from that only include a single value without first creating a vector * We do this by providing a name for the new column, an equals sign, and the value that we want to occur in every row * For example, if all of this data was collected in the same year and we wanted to add that year as a column in our data frame we could do it like this: ```r density_data_year <- data.frame(year = 2000, sites = sites, density = density) ``` * `year =` sets the name of the column in the data frame * And 2000 is that value that will occur on every row of that column * If we run this and look at the `density_data_year` data frame we'll see that it includes the year column with the value `2000` in every row ### Exercise 6 You have data on the length, width, and height of 10 individuals of the yew *Taxus baccata* stored in the following vectors: ```r length <- c(2.2, 2.1, 2.7, 3.0, 3.1, 2.5, 1.9, 1.1, 3.5, 2.9) width <- c(1.3, 2.2, 1.5, 4.5, 3.1, NA, 1.8, 0.5, 2.0, 2.7) height <- c(9.6, 7.6, 2.2, 1.5, 4.0, 3.0, 4.5, 2.3, 7.5, 3.2) ``` - Make a data frame that contains these three vectors as columns along with a `"genus"` column containing the genus name *Taxus* on all rows and a `"species"` column containing the species epithet *baccata* on all rows. --- --- ### Extracting values from vectors and data frames #### Extracting vectors from data frames * There are several ways to extract a vector from a data frame * Let's look at these using the Portal data * We'll start by loading the `surveys` table into R ```r surveys <- read.csv("surveys.csv") ``` * One common approach to extracting a column into a vector is to use the `[]` * Remember that `[]` also mean "give me a piece of something" * Let's get the `species_id` column * `"species_id"` has to be in quotes because we we aren't using `dplyr` ```r surveys["species_id"] ``` * This actually returns a one column data frame, not a vector * To extract a single column as a vector we use two sets of `[]` * Think of the second set of `[]` as getting the single vector from inside the one column data frame ```r surveys[["species_id"]] ``` * We can also do this using `$` * The `$` in R is short hand for `[[]]` in cases where the piece we want to get has a name * So, we start with the object we want a part of, our `surveys` data frame * Then the `$` with no spaces around it * and then the name of the `species_id` column (without quotes, just to be confusing) ```r surveys$species_id ``` ### Exercise 7 Using the Portal data `surveys` table ([download a copy](https://ndownloader.figshare.com/files/2292172) if it's not in your working directory): 1. Use `$` to extract the `weight` column into a vector called `surveys_weight` 2. Use `[]` to extract the `month` column into a vector called `surveys_month` 3. Extract the `hindfoot_length` column into a vector and calculate the mean hindfoot length ignoring missing values. ### Extracting Values from Vectors ```r letters[10] # indexing the 10th letter of the alphabet letters[1:3] # getting the first three letters abc <- letters[c(1,2,3)] # creating a vector of the first three letters of the alphabet letters[3:1] # letters[-1] letters[-1:5] ``` ### Overwriting values in vectors and data frames ### Summary * A data frame is a set of equal length vectors * We can extract a column of a data frame into a vector using either `$` or two sets of square brackets `[[]]` * We can combine vectors into data frames using the `data.frame()` function, which takes a series of arguments, one vector for each column we want to create in the data frame. **A minute feedback for class 18** - Please provide some quick feedback for this session [here](https://docs.google.com/forms/d/e/1FAIpQLSfkCzaYH7qgLgeF9b9KLrCZJSfVmu0AYGLZo2MINBSN04B2tA/viewform). ### Homework - Do 2 exercises from [Shrub volume data set - part 2](/spring2023-data-science/assignments/assignment_data_tables_2/).