I’ve been gradually changing my workflows to be more in line with the prinicples of reproducible research. It has been life-changing – I can go back and run analyses from last year on a different computer and…it just works. It’s absolutely magical.
I’m not fully reproducible yet - I shudder to think what someone would do if they came along my Projects folder at this exact moment, but I’m getting closer. Here are the main reasons I’m going this route and the tools I’m using to do so.
There are lots of resources for reproducible research, but they are mostly overkill for my level. So, here’s what I’ve taken from the discussion.
What is reproducible research?
A reproducible workflow is one in which each step of the analytical process is clearly documented in such a way that someone — and here it is better to imagine that person is not you — can retrace your steps and verify the exact results that you presented. –Baumer BS. (2017) Lessons from between the white lines for isolated data scientists. PeerJ Preprints 5:e3160v2
Why do reproducible research?
- If I can retrace my steps, I can be more confident in my results.
- If other people can follow my steps, they will trust my results more.
- If other people can follow my steps, they can help me when I have problems.
- If other people can follow my steps, they can learn from me.
- If I can follow other people’s steps, I can learn from them.
- I want to be kind to “my future self”. I might get asked to rerun an old analysis or to pass on my projects to someone else. It would be so much easier if that was relatively simple to do and didn’t require my computer to have a particular file structure or software to do so.
My key principles of RR:
- Physically isolate raw data (different folder, server, etc.)
- Absolutely no raw data manipulation in Excel. It is (nearly) impossible to retrace point-and-click steps from Excel.
- No formulas. Ever.
- No sorting. Ever.
- Use automated methods to get data wherever possible (SQL, web scraping,
- “Restart R and Run All Chunks” (Session Menu or inline Run menu)) often. Your script should work from beginning to end in a fresh instance of R.
- Change your RStudio settings - Do NOT restore .RData into workspace at startup - Never save workspace to .RData on exit
Tired of saying "No" each time RStudio asks you if you want to save your workspace upon exit? You can tell RStudio to stop asking with Preferences > General > Save workspace to .RData on exit Never#rstats pic.twitter.com/sJeBuRfRvp— Sharon Machlis (@sharon000) August 23, 2018
The primary feature of projects for me is that they allow me to have several data projects going at once in separate instances of R.
The other key feature is that if all of your files are in the same directory, you can use local references instead of full paths.
Using full paths is a mess, and if I ever move my folder, everything breaks.
library(readxl) mydata <- read_xlsx("C:/HD/MyFiles/MyProjects/BigDataProject/myfile.xlsx").
In a project, though, I can just use short filenames and they will continue to work indenfinitely - RStudio just finds them.
library(readxl) mydata <- read_xlsx("myfile.xlsx")
In my workflow, those are the “laws”. However, there are a few more things that I think improve my coding style, making my analyses more reproducible by myself and hopefully others.
Key Style and Workflow Choices
- Filenames should be: (per Jenny Bryan)
- Machine readable 2018-12-09_bird-growth_tidy.rmd - This way I can search by date (2018), by project (bird-growth) or by document type (tidy.rmd)
- Human readable - project names and file types are clear.
- Plays well with default ordering - dates with year first sort properly. You can also do this with suffixes tidy1.rmd, tidy2.rmd (for sequential scripts - not versions of the same thing!)
- Punctuate - Use hyphens between words within a chunk, and underscores between chunks
- No spaces
- Names - I’m not great at this, but I’d like to be consistently use “tidy.rmd, explore.rmd, analysis.rmd, report.rmd” as filenames (or, if small enough, all in one file)
- Variable names should be:
- Short, but meaningful (fishgrowth vs. myvar or mydf)
- I can’t ever seem to be consistent on using CamelCase vs. snake_case, but in theory you should pick one and stick with it.
- No spaces
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table. Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1 - 23. doi:
Tidy data is easy to work with, easy to analyze, easy to understand. To get from messy data to tidy data (usually) requires a lot of steps. Scripting those steps ensures that I can do my “data munging” steps again if the raw data gets updated with new values. Depending on length, I either do these as a standalone code block in an R Notebook, or as a standalone file called tidy.rmd.
To tidy data, the tools in
tidyr are invaluable. I wrote a post about tidying data.
Here’s an example without the pipe. I had to create 2 extra dummy variables and it isn’t very readable.
library(dplyr) just4cyl <- filter(mtcars, cyl == 4) four_cyl_with_ratio <- mutate(just4cyl, gcratio = gear/carb) clean_cyl_with_ratio <- select(four_cyl_with_ratio, hp, wt, gcratio)
And an example using the pipe. So much better, no? The verbs are lined up and it is clear what I did.
library(dplyr) clean_cyl_with_ratio <- mtcars %>% filter(cyl == 4) %>% mutate(gcratio = gear/carb) %>% select(hp, wt, gcratio)
Especially with R Notebooks, adding descriptive info about your thought process is a breeze. Explain to your future self what you were trying to do and what cludgy workarounds you might have employed along the way.
If you find yourself doing the same thing over and over, write a function or save the code as a script. Hadley’s rule of thumb is anything you do more than twice, but for me the pain of copying and pasting vs. writing a function makes the cutoff for me anything I do more than 10x.
I only know git/github, which is baked into RStudio. Having regular backups and history of my decisions is a lifesaver. Learning it has been a curve, and I’m sure I don’t really understand it. My basic trend is to commit a change at the end of every “complete thought”, whether that is an r code block that produces the output I like, some text that forms a complete thought, etc. I try to only commit when everything is working fine, so I can use it as a restore point.