The purpose of this segment is to share some of our favorite tools for working with data in R. Feel free to integrate these tools into your own workflows and ask your peers for tools they can’t live without either!
here
: no more getting lost in file pathsUse this package
when…all the time. Make it a part of your regular coding routine. It’s that good.
As we introduced in the previous module, here
is an excellent package that’s worth getting to know because it will let you use relative as opposed to absolute pathnames. This will simplify importing and exporting files as well as sharing them with others.
install.packages("here")
library(here)
## here() starts at C:/Users/sbrei/Documents/R_Projects/Collabs/BGSS_Retreat_2021
here::here() # once you call the package with the library(here) call, you can use this function to remind you where your project root begins.
## [1] "C:/Users/sbrei/Documents/R_Projects/Collabs/BGSS_Retreat_2021"
revisit “here” package description
Artwork by Allison Horst. set_wd
, be gone!
magrittr
: these pipes will make your work flowUse this package when… you make your code more readable. The
magrittr
package has a collection of operators called pipes that, as stated on their webpage, work by:
- "structuring sequences of data operations left-to-right (as opposed to from the inside and out),
- avoiding nested function calls,
- minimizing the need for local variables and function definitions, and
- making it easy to add steps anywhere in the sequence of operations."
install.packages("magrittr")
library(magrittr)
library(dplyr) # we'll use some functions from this package to highlight the magrittr pipes
Also load a sample dataset:
starwars <- dplyr::starwars
magrittr
has several types of pipes which can all be useful in different contexts. The main pipe is the most common, but the others can be super handy too.
The basic pipe looks like this: %>%
.
We’ll frequently set up our code like this:
new_object <- old_object %>% function()
Add a pipe after each function to connect them into one cohesive operation. What this means is that you’ll take your initial object (the one you want to apply functions to), apply a function to it and which changes it, then take that changed object (you “pipe” it into the next function; it flows into it as the new object to be modified) and apply another function it it, and so on.
We end up having to define three objects.
# select only the first 4 columns of dataset
character_traits <- starwars[, 1:4]
# select only characters that are at least 100 cm tall
tall_characters <- character_traits[character_traits$height >= 100,]
# order characters by height
tall_characters_sorted <- tall_characters[order(tall_characters$height), ]
We only have to define one object. If we kept on adding functions with pipes, we wouldn’t have to add any more objects, either.
tall_characters_sorted <- starwars %>%
# select only the first 4 columns of dataset
dplyr::select(1:4) %>%
# select only characters that are at least 100 cm tall
dplyr::filter(height >= 100) %>%
# order characters by height
dplyr::arrange(height)
This pipe (%<>%
) is helpful for situations where you’d rather overwrite your original object. Be careful!
Instead of writing old_object <- old_object %>% function()
, like this:
character_traits <- character_traits %>%
# select only the first 4 columns of dataset
dplyr::filter(hair_color == "black")
Do this instead: old_object %<>% function()
(notice that the pipe has two arrows inside it now)
character_traits %<>%
# select only the first 4 columns of dataset
dplyr::filter(hair_color == "black")
Use this pipe (%T>%
) when you want to print a “side effect” of an expression. For example, use this when you want to see what your plot looks like without writing a separate command on another line, like this:
plot1 <- ggplot2::qplot(character_traits$height, character_traits$mass)
plot1
Write it this way, instead:
plot1 <- ggplot2::qplot(character_traits$height, character_traits$mass) %T>%
plot()
patchwork
: make your figures nice and cozyUse this package when… you have multiple figures you’d like to present in a grid, or at least side-by-side. This package is great for positioning figures into neat, orderly positions.
install.packages("patchwork")
library(patchwork)
library(ggplot2) # using this to make plots.
Let’s make a few figures first.
plot1 <- ggplot(starwars %>%
filter(eye_color %in% c("blue", "black", "brown", "yellow")),
aes(x = eye_color,
y = height,
fill = eye_color)) +
geom_boxplot(alpha=0.7) +
theme_classic() +
theme(legend.position="none")
plot2 <- ggplot(starwars %>%
filter(is.na(gender) == FALSE & mass < 1000),
aes(x = mass, y = height)) +
geom_point(aes(shape = gender,
color = gender),
size = 2) +
theme_classic()
plot3 <- starwars %>%
filter(sex != "NA" & sex != "none") %>%
ggplot(aes(x = birth_year,
y = mass,
color = sex,
shape = sex)) +
geom_point(alpha=0.5,
size = 3) +
facet_grid(sex ~ .)
Now, use patchwork
to arrange them:
# two plots, side by side:
(plot1 | plot2)
# three plots: two in first row, one in second row
(plot1 | plot2) /
plot3
# three plots: two in first row, one in second row PLUS title, subtitle, caption!
(plot1 | plot2) /
plot3 + plot_annotation(
title = 'Hardcore Star Wars fan? Look at these plots!',
subtitle = "Who could've predicted that eye color and height aren't related?",
caption = "Masculine characters are of many heights and masses; not so for feminine characters" )
dplyr
: Quickly manipulate your dataUse this package when… you are trying to quickly manipulate, summarize, or combine data.
Often we are required to summarize data by certain categories, groups, or treatments. In other instances, we are looking to create a new column that contains a metric specific for our analysis. The dplyr
package uses a pipe-format that allows for an easy creation of a workflow. Additionally, dplyr
allows for some basic data management including selecting certain columns, renaming them, or sorting data. All of these functions are based off an SQL
backend, which provides some familiarity for those with a data or computer science background.
install.packages("dplyr")
library(dplyr)
## Load sample dataset
starwars <- dplyr::starwars
### Calculate summary statistics for a group within the dataset.
summmarizedData <- starwars %>%
group_by(homeworld) %>% ## select variables to summarize by
summarize(avgHeight = mean(height), nObs= length(height)) ## select which columns to be summarized and by which function
summmarizedData
## # A tibble: 49 x 3
## homeworld avgHeight nObs
## <chr> <dbl> <int>
## 1 Alderaan 176. 3
## 2 Aleen Minor 79 1
## 3 Bespin 175 1
## 4 Bestine IV 180 1
## 5 Cato Neimoidia 191 1
## 6 Cerea 198 1
## 7 Champala 196 1
## 8 Chandrila 150 1
## 9 Concord Dawn 183 1
## 10 Corellia 175 2
## # ... with 39 more rows
### Calculate summary statistics for multiple groups within the dataset.
summmarizedData <- starwars %>%
group_by(homeworld, species) %>% ## select variables to summarize by
summarize(avgHeight = mean(height), nObs= length(height)) ## select which columns to be summarized and by which function
summmarizedData
## # A tibble: 58 x 4
## # Groups: homeworld [49]
## homeworld species avgHeight nObs
## <chr> <chr> <dbl> <int>
## 1 Alderaan Human 176. 3
## 2 Aleen Minor Aleena 79 1
## 3 Bespin Human 175 1
## 4 Bestine IV Human 180 1
## 5 Cato Neimoidia Neimodian 191 1
## 6 Cerea Cerean 198 1
## 7 Champala Chagrian 196 1
## 8 Chandrila Human 150 1
## 9 Concord Dawn Human 183 1
## 10 Corellia Human 175 2
## # ... with 48 more rows
## Create a new column of estimated BMI for each person
starwarsBMI <- starwars %>%
mutate(bmi = mass / (height/100)^2)
starwarsBMI %>% select(name, BMI = bmi) ## show calculated data and rename it to capitalize BMI
## # A tibble: 87 x 2
## name BMI
## <chr> <dbl>
## 1 Luke Skywalker 26.0
## 2 C-3PO 26.9
## 3 R2-D2 34.7
## 4 Darth Vader 33.3
## 5 Leia Organa 21.8
## 6 Owen Lars 37.9
## 7 Beru Whitesun lars 27.5
## 8 R5-D4 34.0
## 9 Biggs Darklighter 25.1
## 10 Obi-Wan Kenobi 23.2
## # ... with 77 more rows
## Find the individuals with the lowest BMI per planet
lowBMI <- starwarsBMI %>%
group_by(homeworld) %>%
slice(which.min(bmi))
lowBMI
## # A tibble: 40 x 15
## # Groups: homeworld [40]
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Leia O~ 150 49 brown light brown 19 fema~ femin~
## 2 Ratts ~ 79 15 none grey, blue unknown NA male mascu~
## 3 Lobot 175 79 none light blue 37 male mascu~
## 4 Jek To~ 180 110 brown fair blue NA male mascu~
## 5 Nute G~ 191 90 none mottled gr~ red NA male mascu~
## 6 Ki-Adi~ 198 82 white pale yellow 92 male mascu~
## 7 Jango ~ 183 79 black tan brown 66 male mascu~
## 8 Han So~ 180 80 brown fair brown 29 male mascu~
## 9 Adi Ga~ 184 50 none dark blue NA fema~ femin~
## 10 Darth ~ 175 80 none red yellow 54 male mascu~
## # ... with 30 more rows, and 6 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>, bmi <dbl>
A cheatsheet for frequently used functions from dplyr
can be found here.
tidyr
: Quickly manipulate your data (continued)Use this package when… you are trying to quickly manipulate, summarize, or combine data.
The tidyr
package comes from the same author as dplyr
and share the same common syntax of pipes and SQL-style structure. There are many functions found within this package but there are two that I think are extremely useful: 1) converting data between long and wide formats and 2) separating a column into multiple.
install.packages("tidyr")
library(tidyr)
## Load sample dataset
data(starwars)
### Select a matrix that has the individual and species. Then convert it from long to wide format
longMat <- starwars %>%
select(name, species, mass) %>%
spread(species, mass)
## convert back to wide format
wideMat <- longMat %>%
gather(species, mass, 2:39)
## Split colour into multiple columns
starwars %>%
separate(skin_color, sep=", ", into=c("MainColour","SecondaryColour","AncillaryColour")) %>%
select(name, MainColour, SecondaryColour, AncillaryColour)
## # A tibble: 87 x 4
## name MainColour SecondaryColour AncillaryColour
## <chr> <chr> <chr> <chr>
## 1 Luke Skywalker fair <NA> <NA>
## 2 C-3PO gold <NA> <NA>
## 3 R2-D2 white blue <NA>
## 4 Darth Vader white <NA> <NA>
## 5 Leia Organa light <NA> <NA>
## 6 Owen Lars light <NA> <NA>
## 7 Beru Whitesun lars light <NA> <NA>
## 8 R5-D4 white red <NA>
## 9 Biggs Darklighter light <NA> <NA>
## 10 Obi-Wan Kenobi fair <NA> <NA>
## # ... with 77 more rows
annotater
: finally remember why you loaded all those packages!Use this package when… you forget why you loaded packages at the top of your R script/notebook/markdown file OR you want to clarify why you did so for collaborators. (+1 points for reproducible science!)
It happens: you start your R file with a list of packages to be loaded with your library()
call. You constantly add to it, listing more packages who functions you use to complete your analysis. Over time, you figure out that your advisors would prefer if you didn’t use the Wes Anderson color palette, or that you’re better off creating figures with patchwork
vs cowplot
(sorry, cowplot
). So, if you’re anything like me, after learning stats and R simultaneously while doing your first thesis project, you end up with a very impressive list of R packages, half of which you can’t remember why you loaded in the first place.
Have no fear! This is where the genius of the annotater
package comes in to save us (and those who try to read our code, bless them)!
install.packages(c("remotes", "see"))
remotes::install_github("luisDVA/annotater")
After you’ve installed remotes
and annotater
, save your R files, close RStudio, and reopen it.
Click anywhere in the Source pane (aka the one with your R files).
Navigate your cursor to the “Addins” button in the bar below the File-Edit-Code-View etc. bar. Click it and select, “Annotate package calls in active file”. Voila!
Gif by Luis Verde Arregoitia
performance
: evaluate your general linear models in a flashUse this package when… you want to evaluate the fit of your linear models.
install.packages("performance")
library(performance)
We’ll set up a linear model, then run some diagnostics:
model1 <- lm(height ~ mass, starwars)
performance::check_model(model1)
Ouch, looks like there’s a major outlier! Jabba the Hutt strikes again!
Let’s remove Jabba’s data point and fit a different model.
model1_withoutJabba <- lm(height ~ mass, starwars %>%
filter(name != "Jabba Desilijic Tiure" ))
performance::check_model(model1_withoutJabba)
Looks MUCH better. Still not fantastic, but better.
Now we can compare the performance of each model.
comparison <- performance::compare_performance(model1, model1_withoutJabba) %>%
as.data.frame() %T>%
print()
## Warning: When comparing models, please note that probably not all models were fit from
## same data.
## Name Model AIC AIC_wt BIC BIC_wt
## 1 model1 lm 592.6594 2.357267e-13 598.8920 2.297591e-13
## 2 model1_withoutJabba lm 534.5072 1.000000e+00 540.6885 1.000000e+00
## R2 R2_adjusted RMSE Sigma
## 1 0.01792498 0.0006955919 34.90924 35.51640
## 2 0.57951869 0.5720100931 23.03830 23.44609
Package (and this image) developed by Daniel Lüdecke, Dominique Makowski, Mattan S. Ben-Shachar, Indrajeet Patil, Philip Waggoner, Brenton M. Wiernik.
Please take 5 minutes to check out the “Feedback Page” link below and give us some feedback about today’s workshop. This will help us teach better workshops, not to mention show future employers that we might actually be sort of good at this. Thanks!