Learning Objectives
This week, students will be able to:
- list the elements of the data life cycle
- articulate the relevance of good data management for scientific research
- identify the differences between good and bad data entry and management
- recognize bad data organization and why it is problematic for research
- implement quality assurance and control measures for data entry in spreadsheets using excel
- list current measures used by the scientific community in ecology and evolution to preserve data long term
Day 1
Welcome!
- Introductions
- Why are you taking this class? What do you expect from it?
Logistics survey:
- Take 5 min to complete the survey on this Google Form.
Syllabus overview
- Go through the syllabus
- Do you have any questions about it?
- Choose grading scheme
- Choose office hours
Schedule overview
The schedule shows a list of topics that will be covered each week of the course.
The schedule reflects a flipped course structure, aka minimal lecture and maximal student practice. Each row corresponds to a different topic, and organizes links for pre-class activities, in-class activities and homework.
- Before Class
- Prepare: Readings or activities to dive into the topic before class
- Lectures/live-coding
- Lecture notes used in class
- Not expected to be read in advance; may be useful for review
- May not match lecture precisely
- Lecture notes used in class
- In-class activities
- Individually or jointly
- A challenge that will support the construction of a mental model of your own
- May require additional work time after class
- After-class
- Strengthen: Exercises to strengthen and evaluate concepts discussed during class
The importance of data for research: Exercises
- Data is used in all areas of human activity. Tim Stobierski writes about business analytics and the importance of knowing the data life cycle.
- Take 5-10 min to read 8 Steps in the Data Life Cycle, from Harvard business school.
- Answer the next two questions interactively on this mentimeter:
- What is the importance of the data life cycle?
- How is it related to research?
- DataONE is an organization dedicated to making earth data universally accessible and FAIR (Findable, Accessible, Interoperable, and Reusable).
- Take 10 min to read The DataONE Life Cycle. If the link is not available, a screen shot of the text is available here.
- Answer the following questions interactively on this mentimeter:
- How many steps are there in the DataOne life cycle?
- Identify steps that are different between the previous and this data life cycle.
- Look at Figure 1 from research paper Best practice data life cycle approaches for the life sciences.
- Take 5 min to identify steps that are similar across the three data life cycle approaches you have reviewed so far. Write the steps that you consider similar on this mentimeter.
In class group activity
- Follow instructions on this jamboard.
Homework
- Create your own data life cycle on this jamboard.
A minute feedback
- Provide some quick feedback for this session here
Day 2
Reading discussion
-
Before class, read the text Scientists Losing Data at a Rapid Rate
- Questions:
- Which research paper are they referring to in the text? Hint: take a look at the references.
- According to the research paper, how much does data availability drop per year?
- Contacted researchers say the data still exists, why is it considered lost?
- What percentage of contacted researchers answered back?
- How can authors organize and preserve data used in research papers?
- Is willingness to share data increasing or decreasing among researchers?
- On the role of journals and data repositories:
- “Some types of data, such as DNA sequences, must be submitted to a community-endorsed public repository”
- data repository websites (GenBank, GBIF, Dryad, Zenodo)
- “For other kinds of data, where public repositories are less developed, this is “strongly recommended”.”
- “Some types of data, such as DNA sequences, must be submitted to a community-endorsed public repository”
- What are some differences between the Nature text you just read and the research paper they talk about (the Vines et al. paper)?
- What is a research paper?
Group exercise
- Work with your neighbor.
- Skim the paper Vines et al. (2014)
- What are the main sections of a research paper?
- Identify the following parts on the abstract:
- Reason for writing: What is the importance of the research?
- Problem: What problem does this work attempt to solve?
- Methodology: Includes specific models or approaches used in the larger study.
- Results: Include specific data that indicates the results of the project
- Implications: How does this work add to the body of knowledge on the topic?
Tidy Data Principles
- What: Tidy data is data that is well designed for processing with computers
- Why: Creating tidy data as you collect it will make it much easier to analyze it later
- Joint exercise: Improving Messy Data
- Goal: look at some messy data and think about what makes it messy and what we could do to improve it:
- make it a (narrow) rectangle
- one cell one value
- don’t confuse the computer
- be clear and consistent
- Use one table for each category of data
- Export data into easy to read formats
- Goal: look at some messy data and think about what makes it messy and what we could do to improve it:
Steps for data entry
- What’s the first thing to do when you are ready to enter/collect data?
- Planning
- Where to enter data?
- Software
- Data quality assurance VS. data quality control
- Read Quality Assurance and Control to answer the following question What is the difference between quality assurance and quality control?
- How to assure data quality?
- Do individual Exercise - Data Entry Validation in Excel.
- Important steps for quality control
- Saving a copy of the original raw data is key in this step.
- Sorting to check for invalid data, demo
- Ensure that data sorting is expanded to the whole data table, so data is not corrupted.
- Conditional formatting to scan data for outliers, Demo
- Use this cautiously, it might corrupt the data.
Best practices for data collection
- What step of the data life cycle we addressed today?
- collection
- assurance
- processing
- management
A minute feedback
- Please provide some quick feedback for this second session of the course here.
Homework
- Complete the homework activities.