Learning Objectives
- understand and explain why coding is advantageous for data management
- learn principles to format R code for readability and clarity
- practice adding comments and breaks to R code
- call R scripts from within R scripts (sourcing)
- properly organize an R project and workspace with RStudio
- read and write tables
Day 1: Intro to R and Rstudio
Review
- What is a GUI?
- In computer paths, what are the two meanings of the forward slash
/
? - What is programming?
- Programming means to write a set of written rules that give instructions to a computational device (such as a computer) to perform a set of tasks.
- Programming is a key tool in data science
- There are many different programming and scripting languages; they all have different strengths.
- What is a CLI? Which CLIs have we used so far?
- R survey results
Intro
- What is R?
- R is also a CLI
- R is a software that interprets the language
- R is a programming and a scripting language
- Started as a statistics and data analysis environment
- But can also build websites, run simulations, build books, build interactive apps, …
- What is a script?
- A file that contains (stores) a sequence of written commands (instructions) that is executed (ran, carried out) by another program (not the computer itself).
- What are the advantages of scripts VS point-and-click workflows?
- No need to remember the workflow
- Rerun the workflow any time
- Others can inspect your workflow
- Get feedback and improve
- Deeper and better understanding of the workflow
- What is reproducibility?
- The ability of someone else (including your future self) to obtain the same results from the same dataset when doing the same analysis.
- What qualities of R make it good for reproducibility?
- It allows saving scripts from an analysis workflow
- With Rmarkdown, it generates static documents displaying code, results and graphics (as PDF or html)
- It is free, open-source and cross-platform
- It has a large community of users
- What else is good about R?
- It is interdisciplinary and extensible
- It creates high-quality graphics
- R works on data of all shapes and sizes
Knowing your way around R and Rstudio
Expressions and variables/objects
- What is RStudio?
- IDE - Integrated Development Environment
- Makes developing code in R easier by including a number of tools in one place
- What are the 4 panes of RStudio?
- The R console; similar to the terminal, as in that it is a CLI, it directly executes any instruction that we write.
- In R, commands are called expressions
- demo
pwd
- The expression
getwd()
- demo
2 + 2
on the R console and on bash - demo other calculations in R with unit conversion
- Exercise: convert the weight of a lab mouse (20 grams) to pounds
- Consider 1 kg = 2.2 lbs
- The scripts pane
- Creating a script in R
- Demo running
2 + 2
from a script - Adding comments to a script with
#
- Saving a script
- Joint exercise: Basic Expressions
- The Files/Plots/Packages/Help/Viewer
- It shows the file structure
- It shows the help
- Other ways of getting help with
?
and??
- The environment/history pane
- What is the R workspace?
- It is the current working environment in R
- A working environment is a temporary space on your CPU’s RAM that “disappears” at the end of the R session
- It includes any user-defined objects
- What is an object in R?
- It is a name with a value associated to it.
- It is a variable; an instance of a class
- It is a data structure having some attributes and methods which act on its attributes.
- How to create objects in R?
- the
<-
assignment operatorweight_kg <- 20
executes the code, creates the object/variable(weight_kg <- 20)
prints value of variable to screen
- Solo/joint exercise Basic Variables
- Homework exercises More variables and Variable names.
- What is an object in R?
- It is good practice to never save the R workspace and always start sessions fresh
- Demo: How to never save the workspace in R - What is the R history?
- You can access the history with the up and down arrows
- Also with the function
history()
- The R console; similar to the terminal, as in that it is a CLI, it directly executes any instruction that we write.
Rstudio projects
- What is the working directory in R?
- It is one of many best practices for data science and reproducibility
- It is a self-contained folder storing a set of related data, analyses, and text together
- It is the “place” from where R will be looking for and saving any files
- Allows writing code that only relies on files within the working directory folder
- It allows using relative paths instead of absolute paths
- why are relative paths good for reproducibility?
- It allows you to move your project around on your computer and share it with others without worrying about whether or not the underlying scripts will still work
- Working dirs in R are set up with projects
- Demo: setting up an RStudio project
- Run
getwd()
again - the standard organization of a working directory
data
folderdata-raw
folderscripts
folderfigs
folder- a README file
Exercise: the README file
- Start a README file describing the folders you just created in your project.
- Save it with the .txt extension
- The README must have at least two sections:
- General Information
- A title for the project
- Name/institution/address/email information for person responsible for the project
- Date that the project was started
- Keywords describing the project
- Programming language(s) used
- Funding sources
- License, read What is the most appropriate license for my data?
- File overview
- Write down the name and a description for each folder in your project
- General Information
- For more information on README best practices, read Guide to writing “readme” style metadata from the Research Data Management Service Group at Cornell University.
Basics of functions
- functions and their arguments
- similarities and differences of R functions with shell commands
- name of function
- options and arguments go inside parenthesis instead of separated by blank spaces
sqrt(25)
- variables can be passed as arguments
weight <- 121.38 sqrt(weight)
- the
str()
function shows the type of a value or object/variable- All values (and therefore all variables) have types
str(weight)
- How many arguments can a function take?
- Functions can take multiple arguments.
- Round
weight_lb
to one decimal place - Typing
round()
shows there are two arguments - Number to be rounded and number of digits
round(weight, 1)
- Round
- Functions can take multiple arguments.
- Functions return values, so as with other values and expressions, if we don’t save the output of a function then there is no way to access it later
- Functions do not change the value of a variable
- For example, looking at
weight_lb
we see that it hasn’t been roundedweight
- To save the output of a function we assign it to an object/variable.
weight_rounded <- round(weight, 1) weight_rounded
- Exercise Built-in functions
A minute feedback for class 5
- Please provide some quick feedback for this session here
Day 2: Intro to vectors
Review
- bash shell
echo
andprint
commands- the exclamation mark
!
command, many uses:- repeating the last command
- searching the beginning of a command in the history
- reuse the file path from teh previous command
!$
- start a shell script
#!/usr/bin/bash
- escaping
!
to use as a character string
- difference between
>
and>>
exercise:echo "hello" > hello.txt nano hello.txt echo "hello" >> hello.txt nano hello.txt echo "leopard" > hello.txt
Basics of Vectors
- All values in R have a basic type: numeric, logical, character, integer
- A vector is a sequence of values that all have the same type
- numeric vectors:
- We can use the colon
:
operator to create sequences of numbers1:3
- With the
c()
function we can add numbers in any order we wantc(10, 1, 8) # random order
- But with
:
we can create sequences as long as we want to with just a few key strokes:1:10 1:100 1:4589567
- The function
seq()
creates sequences with any step we specify (not only 1 as with:
)seq(from = 1, to = 100, by = 2) seq(from = 1, to = 100, by = 0.5)
- We can start numeric sequences at any number, in reverse order, and using negative numbers,
- with the
:
operator:15:20 100:50 -100:50 5:-5
- and with
seq()
(pay attention to the sign of the step (by =
argument))seq(15, 20) seq(100, 50, -2) seq(-100, 50, 2)
- We can use numeric vectors to calculate common summary statistics
- For example, if we have a vector of population counts
count <- c(9, 16, 3, 10) mean(count) max(count) min(count) sum(count) summary(count)
- with the
- We can use the colon
- character vectors:
- Created using the
c()
function, which stands for “combine”states <- c("FL", "FL", "GA", "SC")
- Many functions in R take a vector as input and return a value
- The
str()
function shows that this is a vector of 4 character strings
str(states)
- Other useful functions to explore the structure of an object/variable
type()
class()
length
which determines how many items are in a vectorhead()
,tail()
, andview()
- Created using the
- subsetting vectors
- Select pieces of a vector by slicing the vector (like slicing a pizza)
- Use square brackets
[]
- In general
[]
in R means, “give me a piece of something” states[1]
gives us the first value in the vectorstates[1:3]
gives us the first through the third values1:3
works by making a vector of the whole numbers 1 through 3.- So, this is the same as
states[1:3]
is the same asstates[c(1, 2, 3)]
-
You can use a vector to get any subset or order you want
states[c(4, 1, 3)]
- logical vectors:
- can be created with
c()
- cam also be created with relational operators: equality
==
, larger than, smaller than, not equal to: - Some examples of relational statements are:
1 == 1
1 == 2
1 != 2
1 > 2
1 > 1
1 >= 1
1 < 2
1 <= 2
"A" == "A"
"A" == "a"
"A" != "a"
- we can compare a longer vector and a smaller vector
1:10 == 7
- this returns a vector of length eual to the larger vector
Do Basic Vectors.
Null values
- So far we’ve worked with vectors that contain no missing values
- But most real world data has values that are missing for a variety of reasons
- For example, kangaroo rats don’t like being caught by humans and are pretty good at escaping before you’ve finished measuring them
- Missing values, known as “null” values, are written in R as
NA
with no quotes, which is short for “not available” - So a vector of 4 population counts with the third value missing would look like
count_na <- c(9, 16, NA, 10)
- If we try to take the mean of this vector we get
NA
?
mean(count_na)
- Hard to say what a calculation including
NA
should be - So most calculations return
NA
whenNA
is in the data - Can tell many functions to remove the
NA
before calculating - Do this using an optional argument, which is an argument that we don’t have to include unless we want to modify the default behavior of the function
- Add optional arguments by providing their name (
na.rm
),=
, and the value that we want those arguments to take (TRUE
)
mean(count_na, na.rm = TRUE)
- relational operations with
NA
NA > 3 # is obviously NA because we don't know if the missing value is larger than 3 or not NA == NA # the same with this, we have two missing values but the true values could be quite different, so the correct answer is "I don't know."
Do Nulls in Vectors.
Working with multiple vectors
- Build on example where we have information on states and population counts by adding areas
states <- c("FL", "FL", "GA", "SC")
count <- c(9, 16, 3, 10)
area <- c(3, 5, 1.9, 2.7)
Vector math
- We can divide the count vector by the area vector to get a vector of the density of individuals in that area
density <- count / area
- This works because when we divide vectors, R divides the first value in the first vector by the first value in the second vector, then divides the second values in each vector, and so on
- Element-wise: operating on one element at a time
Filtering
- Subsetting or “filtering” is done using
[]
- Like with slicing, the
[]
say “give me a piece of something” - Selects parts of vectors based on “conditions” not position
- Get the density values in site a
density[states == 'FL']
==
is how we indicate “equal to” in most programming languages.-
Not
=
.=
is used for assignment. - Can also do “not equal to”
density[states != 'FL']
- Numerical comparisons like greater or less than
- Select states that meet with some restrictions on density
states[density > 3]
states[density < 3]
states[density <= 3]
- Can subset a vector based on itself
- If we want to look at the densities greater than 3
density
is both the vector being subset and part of the condition
density[density > 3]
- Multiple vectors can be used together to perform element-wise math, where we do the same calculation for each position in the vectors
- We can also filter the values in vector based on the values in another vector or itself
Homework: Do Shrub Volume Vectors exercise.
A minute feedback for class 6
- Please provide some quick feedback for this session here