The purpose of this segment is to share some of our favorite tools for working with data in R. Feel free to integrate these tools into your own workflows and ask your peers for tools they can’t live without either!

here: no more getting lost in file paths

Use this package when… all the time. Make it a part of your regular coding routine. It’s that good.

As we introduced in the previous module, here is an excellent package that’s worth getting to know because it will let you use relative as opposed to absolute pathnames. This will simplify importing and exporting files as well as sharing them with others.

How to install

install.packages("here")

How to use

library(here)
## here() starts at C:/Users/sbrei/Documents/R_Projects/Collabs/BGSS_Retreat_2021
here::here()   # once you call the package with the library(here) call, you can use this function to remind you where your project root begins.
## [1] "C:/Users/sbrei/Documents/R_Projects/Collabs/BGSS_Retreat_2021"

revisit “here” package description

Artwork by Allison Horst. set_wd, be gone!

magrittr: these pipes will make your work flow

Use this package when… you make your code more readable. The magrittr package has a collection of operators called pipes that, as stated on their webpage, work by:

  • "structuring sequences of data operations left-to-right (as opposed to from the inside and out),
  • avoiding nested function calls,
  • minimizing the need for local variables and function definitions, and
  • making it easy to add steps anywhere in the sequence of operations."

How to install

install.packages("magrittr")

How to use

library(magrittr)
library(dplyr) # we'll use some functions from this package to highlight the magrittr pipes

Also load a sample dataset:

starwars <- dplyr::starwars

magrittr has several types of pipes which can all be useful in different contexts. The main pipe is the most common, but the others can be super handy too.

Basic pipe

The basic pipe looks like this: %>%.

We’ll frequently set up our code like this:

new_object <- old_object %>% function()

Add a pipe after each function to connect them into one cohesive operation. What this means is that you’ll take your initial object (the one you want to apply functions to), apply a function to it and which changes it, then take that changed object (you “pipe” it into the next function; it flows into it as the new object to be modified) and apply another function it it, and so on.

Old way, without pipes

We end up having to define three objects.

# select only the first 4 columns of dataset
character_traits <- starwars[, 1:4] 

# select only characters that are at least 100 cm tall
tall_characters <- character_traits[character_traits$height >= 100,]

# order characters by height
tall_characters_sorted <- tall_characters[order(tall_characters$height), ]
New way, with pipes

We only have to define one object. If we kept on adding functions with pipes, we wouldn’t have to add any more objects, either.

tall_characters_sorted <- starwars %>%
  # select only the first 4 columns of dataset
  dplyr::select(1:4) %>%
  # select only characters that are at least 100 cm tall
  dplyr::filter(height >= 100) %>%
  # order characters by height
  dplyr::arrange(height)

Assignment pipe

This pipe (%<>%) is helpful for situations where you’d rather overwrite your original object. Be careful!

Instead of writing old_object <- old_object %>% function(), like this:

character_traits <- character_traits %>%
  # select only the first 4 columns of dataset
  dplyr::filter(hair_color == "black")

Do this instead: old_object %<>% function() (notice that the pipe has two arrows inside it now)

character_traits %<>% 
  # select only the first 4 columns of dataset
  dplyr::filter(hair_color == "black")

Tee pipe

Use this pipe (%T>%) when you want to print a “side effect” of an expression. For example, use this when you want to see what your plot looks like without writing a separate command on another line, like this:

plot1 <- ggplot2::qplot(character_traits$height, character_traits$mass)

plot1

Write it this way, instead:

plot1 <- ggplot2::qplot(character_traits$height, character_traits$mass) %T>%
  plot()

patchwork: make your figures nice and cozy

Use this package when… you have multiple figures you’d like to present in a grid, or at least side-by-side. This package is great for positioning figures into neat, orderly positions.

How to install

install.packages("patchwork")

How to use

library(patchwork)
library(ggplot2) # using this to make plots.

Let’s make a few figures first.

plot1 <- ggplot(starwars %>%
         filter(eye_color %in% c("blue", "black", "brown", "yellow")),
       aes(x = eye_color,
           y = height,
           fill = eye_color)) + 
    geom_boxplot(alpha=0.7) +
  theme_classic() +
  theme(legend.position="none")

plot2 <- ggplot(starwars %>%
         filter(is.na(gender) == FALSE & mass < 1000),
                aes(x = mass, y = height)) +
  geom_point(aes(shape = gender,
                 color = gender),
             size = 2) +
  theme_classic()


plot3 <- starwars %>%
  filter(sex != "NA" & sex != "none") %>%
  ggplot(aes(x = birth_year,
             y = mass,
             color = sex,
             shape = sex)) +
    geom_point(alpha=0.5,
               size = 3) +
    facet_grid(sex ~ .)

Now, use patchwork to arrange them:

# two plots, side by side:
(plot1 | plot2)

# three plots: two in first row, one in second row 
(plot1 | plot2) /
plot3

# three plots: two in first row, one in second row PLUS title, subtitle, caption!
(plot1 | plot2) /
plot3 + plot_annotation(
  title = 'Hardcore Star Wars fan? Look at these plots!',
  subtitle = "Who could've predicted that eye color and height aren't related?",
  caption = "Masculine characters are of many heights and masses; not so for feminine characters" )

dplyr: Quickly manipulate your data

Use this package when… you are trying to quickly manipulate, summarize, or combine data.

Often we are required to summarize data by certain categories, groups, or treatments. In other instances, we are looking to create a new column that contains a metric specific for our analysis. The dplyr package uses a pipe-format that allows for an easy creation of a workflow. Additionally, dplyr allows for some basic data management including selecting certain columns, renaming them, or sorting data. All of these functions are based off an SQL backend, which provides some familiarity for those with a data or computer science background.

How to install

install.packages("dplyr")

How to use

library(dplyr)

## Load sample dataset
starwars <- dplyr::starwars

###  Calculate summary statistics for a group within the dataset. 
summmarizedData <- starwars %>% 
                      group_by(homeworld) %>%  ## select variables to summarize by
                      summarize(avgHeight = mean(height), nObs= length(height)) ## select which columns to be summarized and by which function
summmarizedData
## # A tibble: 49 x 3
##    homeworld      avgHeight  nObs
##    <chr>              <dbl> <int>
##  1 Alderaan            176.     3
##  2 Aleen Minor          79      1
##  3 Bespin              175      1
##  4 Bestine IV          180      1
##  5 Cato Neimoidia      191      1
##  6 Cerea               198      1
##  7 Champala            196      1
##  8 Chandrila           150      1
##  9 Concord Dawn        183      1
## 10 Corellia            175      2
## # ... with 39 more rows
###  Calculate summary statistics for multiple groups within the dataset. 
summmarizedData <- starwars %>% 
                      group_by(homeworld, species) %>%  ## select variables to summarize by
                      summarize(avgHeight = mean(height), nObs= length(height)) ## select which columns to be summarized and by which function
summmarizedData
## # A tibble: 58 x 4
## # Groups:   homeworld [49]
##    homeworld      species   avgHeight  nObs
##    <chr>          <chr>         <dbl> <int>
##  1 Alderaan       Human          176.     3
##  2 Aleen Minor    Aleena          79      1
##  3 Bespin         Human          175      1
##  4 Bestine IV     Human          180      1
##  5 Cato Neimoidia Neimodian      191      1
##  6 Cerea          Cerean         198      1
##  7 Champala       Chagrian       196      1
##  8 Chandrila      Human          150      1
##  9 Concord Dawn   Human          183      1
## 10 Corellia       Human          175      2
## # ... with 48 more rows
## Create a new column of estimated BMI for each person
starwarsBMI <- starwars %>% 
                  mutate(bmi = mass / (height/100)^2)
starwarsBMI %>% select(name, BMI = bmi) ## show calculated data and rename it to capitalize BMI
## # A tibble: 87 x 2
##    name                 BMI
##    <chr>              <dbl>
##  1 Luke Skywalker      26.0
##  2 C-3PO               26.9
##  3 R2-D2               34.7
##  4 Darth Vader         33.3
##  5 Leia Organa         21.8
##  6 Owen Lars           37.9
##  7 Beru Whitesun lars  27.5
##  8 R5-D4               34.0
##  9 Biggs Darklighter   25.1
## 10 Obi-Wan Kenobi      23.2
## # ... with 77 more rows
## Find the individuals with the lowest BMI per planet
lowBMI <- starwarsBMI %>% 
            group_by(homeworld) %>% 
            slice(which.min(bmi))
lowBMI
## # A tibble: 40 x 15
## # Groups:   homeworld [40]
##    name    height  mass hair_color skin_color  eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr> 
##  1 Leia O~    150    49 brown      light       brown             19 fema~ femin~
##  2 Ratts ~     79    15 none       grey, blue  unknown           NA male  mascu~
##  3 Lobot      175    79 none       light       blue              37 male  mascu~
##  4 Jek To~    180   110 brown      fair        blue              NA male  mascu~
##  5 Nute G~    191    90 none       mottled gr~ red               NA male  mascu~
##  6 Ki-Adi~    198    82 white      pale        yellow            92 male  mascu~
##  7 Jango ~    183    79 black      tan         brown             66 male  mascu~
##  8 Han So~    180    80 brown      fair        brown             29 male  mascu~
##  9 Adi Ga~    184    50 none       dark        blue              NA fema~ femin~
## 10 Darth ~    175    80 none       red         yellow            54 male  mascu~
## # ... with 30 more rows, and 6 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>, bmi <dbl>

A cheatsheet for frequently used functions from dplyr can be found here.

tidyr: Quickly manipulate your data (continued)

Use this package when… you are trying to quickly manipulate, summarize, or combine data.

The tidyr package comes from the same author as dplyr and share the same common syntax of pipes and SQL-style structure. There are many functions found within this package but there are two that I think are extremely useful: 1) converting data between long and wide formats and 2) separating a column into multiple.

How to install

install.packages("tidyr")

How to use

library(tidyr)

## Load sample dataset
data(starwars)

### Select a matrix that has the individual and species. Then convert it from long to wide format
longMat <- starwars %>% 
            select(name, species, mass) %>% 
            spread(species, mass)

## convert back to wide format
wideMat <- longMat %>% 
            gather(species, mass, 2:39)


## Split colour into multiple columns
starwars %>% 
  separate(skin_color, sep=", ", into=c("MainColour","SecondaryColour","AncillaryColour")) %>% 
  select(name, MainColour, SecondaryColour, AncillaryColour)
## # A tibble: 87 x 4
##    name               MainColour SecondaryColour AncillaryColour
##    <chr>              <chr>      <chr>           <chr>          
##  1 Luke Skywalker     fair       <NA>            <NA>           
##  2 C-3PO              gold       <NA>            <NA>           
##  3 R2-D2              white      blue            <NA>           
##  4 Darth Vader        white      <NA>            <NA>           
##  5 Leia Organa        light      <NA>            <NA>           
##  6 Owen Lars          light      <NA>            <NA>           
##  7 Beru Whitesun lars light      <NA>            <NA>           
##  8 R5-D4              white      red             <NA>           
##  9 Biggs Darklighter  light      <NA>            <NA>           
## 10 Obi-Wan Kenobi     fair       <NA>            <NA>           
## # ... with 77 more rows

annotater: finally remember why you loaded all those packages!

Use this package when… you forget why you loaded packages at the top of your R script/notebook/markdown file OR you want to clarify why you did so for collaborators. (+1 points for reproducible science!)

It happens: you start your R file with a list of packages to be loaded with your library() call. You constantly add to it, listing more packages who functions you use to complete your analysis. Over time, you figure out that your advisors would prefer if you didn’t use the Wes Anderson color palette, or that you’re better off creating figures with patchwork vs cowplot (sorry, cowplot). So, if you’re anything like me, after learning stats and R simultaneously while doing your first thesis project, you end up with a very impressive list of R packages, half of which you can’t remember why you loaded in the first place.

Have no fear! This is where the genius of the annotater package comes in to save us (and those who try to read our code, bless them)!

How to install

install.packages(c("remotes", "see"))
remotes::install_github("luisDVA/annotater")

How to use

After you’ve installed remotes and annotater, save your R files, close RStudio, and reopen it.

Click anywhere in the Source pane (aka the one with your R files).

Navigate your cursor to the “Addins” button in the bar below the File-Edit-Code-View etc. bar. Click it and select, “Annotate package calls in active file”. Voila!

Gif by Luis Verde Arregoitia

performance: evaluate your general linear models in a flash

Use this package when… you want to evaluate the fit of your linear models.

How to install

install.packages("performance")

How to use

library(performance)

We’ll set up a linear model, then run some diagnostics:

model1 <- lm(height ~ mass, starwars)


performance::check_model(model1)

Ouch, looks like there’s a major outlier! Jabba the Hutt strikes again!

Let’s remove Jabba’s data point and fit a different model.

model1_withoutJabba <- lm(height ~ mass, starwars %>%
                            filter(name != "Jabba Desilijic Tiure" ))

performance::check_model(model1_withoutJabba)

Looks MUCH better. Still not fantastic, but better.

Now we can compare the performance of each model.

comparison <- performance::compare_performance(model1, model1_withoutJabba) %>%
  as.data.frame() %T>%
  print()
## Warning: When comparing models, please note that probably not all models were fit from
##   same data.
##                  Name Model      AIC       AIC_wt      BIC       BIC_wt
## 1              model1    lm 592.6594 2.357267e-13 598.8920 2.297591e-13
## 2 model1_withoutJabba    lm 534.5072 1.000000e+00 540.6885 1.000000e+00
##           R2  R2_adjusted     RMSE    Sigma
## 1 0.01792498 0.0006955919 34.90924 35.51640
## 2 0.57951869 0.5720100931 23.03830 23.44609

Package (and this image) developed by Daniel Lüdecke, Dominique Makowski, Mattan S. Ben-Shachar, Indrajeet Patil, Philip Waggoner, Brenton M. Wiernik.

Survey

Please take 5 minutes to check out the “Feedback Page” link below and give us some feedback about today’s workshop. This will help us teach better workshops, not to mention show future employers that we might actually be sort of good at this. Thanks!

Home Feedback Page