This is an example of a messy dataset I might receive that needs tidying. Interns measured the length and dorsal girth (measurement around the middle) of several fish over time. Each time they measured the fish, they added a few more lines to the excel sheet for that date. This made data entry very convenient, but isn’t a usable format for data analysis.
It’s not a problem, though, because we have the Tidyverse!
Our goal is to produce a tidy dataset that we can use. Perhaps we want to graph the growth of the fish over time. Or maybe we want to see if the fish differ in their len:girth ratio in some way. Perhaps we want to see whether there are seasonal changes in size. Look at those variables I just mentioned. Fish, Time, Length, Girth, L:G ratio, Season. Those are things we want to be columns.
So, our goal is to tidy the data by: - making a new column for FishID (instead of the 4 columns there now) - spreading the column for measurement into separate columns for length and girth - cleaning up missing data
So, first we start by loading some basic packages to get going.
library(readxl) library(tidyverse) #Read in our mock raw data file Morphs <- read_xlsx("../attachments/FishMorphometrics.xlsx", col_types = c("date", "text", "numeric","numeric", "numeric", "numeric"), na = "NA")
Now we are ready to tidy! This is what I came up with. It’s a little cludgy, but it works, and it illustrates the Tidyverse grammar nicely (which is really the point here, after all).
MorphTidy <- Morphs %>% #Rename the date column because (1) I hate all caps and (2) `Date` is a reserved word that can sometimes be problematic if used as a variable name. Best to rename it something like `TheDate` or `DateMeas`. rename(MeasDate = DATE) %>% #Gather the fish into a single column gather(starts_with("Fish"), key = FishID, value = Meas ) %>% #Add row numbers to avoid "duplicate identifiers" error. Each row needs to have unique info, but in this case some of the fish/date lines have the same info. mutate(i = row_number()) %>% spread(key = Measure, value = Meas) %>% #After spreading we don't need the identifier anymore. select(-i) %>% #The various measurements are now in different rows. This code gets them to be in the same row and then removes any dates where fish weren't measured. arrange(FishID, MeasDate) %>% group_by(FishID, MeasDate) %>% summarize(len = mean(Length, na.rm = TRUE), dg = mean(DG, na.rm = TRUE)) %>% filter((len + dg) > 0) %>% mutate(len = ifelse(len == 0, NA, len), dg = ifelse(dg == 0, NA, dg)) %>% arrange(FishID, MeasDate) #If desired, save the cleaned file for future work #library(writexl) #write_xlsx(MorphsTidy, "morphstidy.xlsx")
Now, if we want, we can graph it.
Looking at changes in length over time…
ggplot(MorphTidy, aes(x = MeasDate, y = len, color = FishID)) + geom_point()
Looking at changes in girth over time…
ggplot(MorphTidy, aes(x = MeasDate, y = dg, color = FishID)) + geom_point()
Looking at differences in length:girth ratio…
MorphTidy2 <- MorphTidy %>% mutate(lgr = len/dg) ggplot(MorphTidy2, aes(x=FishID, y = lgr, color = FishID)) + geom_boxplot()
Looking at seasonal changes…
library(lubridate) MorphTidy3 <- MorphTidy %>% mutate(month = month(MeasDate, label = TRUE)) p <- ggplot(MorphTidy3, aes(x = month, y = dg)) + geom_boxplot() q <- ggplot(MorphTidy3, aes(x = month, y = len)) + geom_boxplot() library(gridExtra) grid.arrange(p, q, ncol = 2)