Estimated lesson teaching time: 20 minutes

Assumed background knowledge: Basic dplyr commands, how to use magrittr pipes (%>%)

Overview

What is SQL?

Many of us are accustomed to producing our own data or downloading it from repositories, saving it to a folder on our local machines, and then analyzing it in a program like RStudio. However, sometimes it’s not possible or practical to store data on a local drive, such as if it is very large or of a sensitive nature. In cases like this, it’s better to store the data on a remote server and obtain the information we need by querying (retrieving specific information from) the dataset, rather than downloading the entire thing and working with it locally. A commonly-used language for querying databases is called SQL, pronounced “S-Q-L” or “sequel”. It stands for “Structured Query Language”.

Why are we including an introduction to SQL?

A few reasons:

  • SQL proficiency is frequently cited as a sought-after skill in many fields, including data science. It’s safe to say that knowing SQL is typically a very helpful asset when entering data science.

  • Though we can’t provide a comprehensive introduction to SQL, we thought it’d be helpful to offer a little motivation to learn more and a small push to help get you started.

  • Many academic ecologists are used to collecting their own data and storing it locally. This lesson will help acquaint you with practices frequently used in industry, where it is much more common to work with data stored in remote databases.

Important aside: Data science is an exceptionally diverse field that is impossible to describe with just a few sentences and includes highly valued workers who do not use SQL regularly.

What’s my secret weapon?

Yes, you! When it comes to learning SQL, your secret weapon is this:

If you are an R user familiar with dplyr, you already know the basics of SQL!

Credit: Ben Baumer and R. Jordan Crouser, from their SDS 192: Introduction to Data Science slides.

SQL has been around for decades (since the 70s). dplyr is an R package launched in 2014 whose grammar is based on SQL syntax.

Basics of SQL for R Users

We have 3 goals for today’s lesson:

  1. Set up coding environment to query SQL databases with R
  2. Learn basic similarities and differences between SQL and dplyr syntax
  3. Practice translating dplyr to SQL and vice versa

By accomplishing these goals, we hope to give you enough confidence to continue learning SQL and better prepare yourself for a career involving data science.

Note: This lesson was structured similarly to the Data Carpentry “SQL and R Databases” lesson but is shorter and designed for our specific attendee group.

Set up workspace

Install packages

install.packages(c("dplyr", "dbplyr", "RSQLite", "magrittr"))

Load libraries

library("dplyr")
library("dbplyr")
library("RSQLite")
library("magrittr")

Connect to database

First, create a new folder in your home directory called “data_raw”. Then, download the SQLite database. Finally, connect with the database and name it “mammals”.

# create raw data folder
dir.create("data_raw",
           showWarnings = FALSE)

# download sqlite database
download.file(url = "https://ndownloader.figshare.com/files/2292171",
              destfile = "data_raw/portal_mammals.sqlite",
              mode = "wb")

# access the database and name it "mammals"
mammals <- DBI::dbConnect(
  RSQLite::SQLite(),
  "data_raw/portal_mammals.sqlite")

Hold up. What is going on in that last line of code?

  • dbConnect is a function from the DBI package. We’re using it to send R commands to our database and can interface with lots of databases regardless of the management system (e.g. MySQL, PostgreSQL, BigQuery, etc.)

  • The SQLite function from the RSQLite package lets us R users interface with SQLite databases specifically. We are telling R to connect to the portal_mammals.sqlite database.

Explore the data

First, let’s see what our data looks like.

src_dbi(mammals)
## src:  sqlite 3.39.1 [C:\Users\sbrei\Documents\R_Projects\Collabs\ESA_CSEE_2022\SQL_tutorial\SQL_intro\data_raw\portal_mammals.sqlite]
## tbls: plots, species, surveys

We see that this database contains three tables: plots, species, and surveys. Let’s take a closer look at the surveys table.

surveys <- tbl(mammals, "surveys")
print(surveys)
## # Source:   table<surveys> [?? x 9]
## # Database: sqlite 3.39.1 [C:\Users\sbrei\Documents\R_Projects\Collabs\ESA_CSEE_2022\SQL_tutorial\SQL_intro\data_raw\portal_mammals.sqlite]
##    record_id month   day  year plot_id species_id sex   hindfoot_length weight
##        <int> <int> <int> <int>   <int> <chr>      <chr>           <int>  <int>
##  1         1     7    16  1977       2 NL         M                  32     NA
##  2         2     7    16  1977       3 NL         M                  33     NA
##  3         3     7    16  1977       2 DM         F                  37     NA
##  4         4     7    16  1977       7 DM         M                  36     NA
##  5         5     7    16  1977       3 DM         M                  35     NA
##  6         6     7    16  1977       1 PF         M                  14     NA
##  7         7     7    16  1977       2 PE         F                  NA     NA
##  8         8     7    16  1977       1 DM         M                  37     NA
##  9         9     7    16  1977       1 DM         F                  34     NA
## 10        10     7    16  1977       6 PF         F                  20     NA
## # ... with more rows
## # i Use `print(n = ...)` to see more rows

The first line of the output imparts an important lesson: R tries to be as lazy as possible when interfacing with databases. For instance, this simple command does not tell us the dimensions of the surveys table. The length is unknown (“??”) but the width = 9 columns. Unless we ask it for the length specifically, it will not tell us because R did not actually view the entire table when fetching the first 10 rows for us.

The table itself looks similar to what we’re used to manipulating in R as data frames. There are at least 10 rows and several columns, including “record_id”, “month”, and “day”. This table contains information about mammals observed in different plots over several years.

Query the data

Let’s start wrangling the data and ask: how many males were observed in 1978 in plot 2?

Challenge: convert this question into a dplyr command. (To see the answer, press the “CODE” button on the right side of the page).

surveys %>%
  filter(sex == "M",
         year == "1978",
         plot_id == "2") %>%
  count() # there are 52 individuals.
## # Source:   SQL [1 x 1]
## # Database: sqlite 3.39.1 [C:\Users\sbrei\Documents\R_Projects\Collabs\ESA_CSEE_2022\SQL_tutorial\SQL_intro\data_raw\portal_mammals.sqlite]
##       n
##   <int>
## 1    52

As you can see, querying the surveys table with a question like this isn’t really any different than working with an R data frame!

Save subset to R environment

If you want to save a subset of a table as a new object, then take note of a few differences when working with SQL databases.

Specifically, we need to add one more line of code to tell our lazy bones of an R interface to finish the command.

Example: ask R to save a subset of the surveys table that 1) includes data from 1998-2000 and 2) counts up the number of species seen per year.

annual_species <- surveys %>%
  filter(year == "1998" |
           year == "1999" |
           year == "2000") %>%
  group_by(year, species_id) %>%
  summarise(n = n()) %>%
  collect() # <-  don't forget this command!
## `summarise()` has grouped output by "year". You can override using the
## `.groups` argument.

Specifically, the collect() command is essential for telling R that we’re done creating a query and are ready to implement it.

Notice the change in data structure if we do not include the collect() command:

annual_species_notcollected <- surveys %>%
  filter(year == "1998" |
           year == "1999" |
           year == "2000") %>%
  group_by(year, species_id) %>%
  summarise(n = n())

class(annual_species_notcollected)
## [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
## [4] "tbl_lazy"             "tbl"
class(annual_species)
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
  • annual_species_notcollected is not a data frame.

  • annual_speciesis a data frame.

Translate dplyr syntax to SQL

A simple way to translate your dplyr syntax to SQL syntax is to pipe in the show_query() function.

Translated R –> SQL syntax

annual_species_notcollected %>%
  show_query()
## <SQL>
## SELECT `year`, `species_id`, COUNT(*) AS `n`
## FROM `surveys`
## WHERE (`year` = '1998' OR `year` = '1999' OR `year` = '2000')
## GROUP BY `year`, `species_id`

Original R syntax

annual_species_notcollected <- surveys %>%
  filter(year == "1998", "1999", "2000") %>%
  group_by(year, species_id) %>%
  summarise(n = n())

Notice: How does the original R code compare to the translated SQL syntax?

This tool is very handy for learning SQL, but practitioners recommend learning at least basic SQL syntax when getting into data science. Another reason why? So that you can read, in addition to write, SQL.

Challenge

SQL to R

Translate this SQL query into R code. Click here for a straightforward SQL cheat sheet and a dplyr cheat sheet to help get you started. Remember to unhide the correct translation by clicking the “CODE” button below the SQL syntax section.

SQL syntax

SELECT *

FROM (SELECT record_id, hindfoot_length

FROM surveys)

WHERE (hindfoot_length > 20.0)


R syntax translated:

surveys %>%
  select(record_id, hindfoot_length) %>%
  filter(hindfoot_length > 20)

R to SQL

Translate this dplyr code into an SQL query by filling in the blanks. Remember to unhide the correct translation by clicking the “CODE” button below the SQL syntax section.

R syntax

surveys %>%
  filter(year > 1977 & year < 2000) %>%
  filter(hindfoot_length == 30) %>%
  arrange(desc(species_id))

SQL syntax

SELECT *

FROM (SELECT *

FROM ____

WHERE (year > 1977.0 ____ year < 2000.0))

____ (hindfoot_length = 30.0)

ORDER BY species_id ____

HINT: For a word bank of possible SQL terms, click the “CODE” button directly below ↓

AND
DESC
`surveys`
WHERE


SQL syntax translated:

Underscrores around words (e.g. __example__) indicate correct fill-in-the-blank answers.


SELECT *
FROM (SELECT *
FROM __`surveys`__
WHERE (`year` > 1977.0 __AND__ `year` < 2000.0))
__WHERE__ (`hindfoot_length` = 30.0)
ORDER BY `species_id` __DESC__

Next steps

Here are some resources to help you continue on your SQL-learning path.

  1. Stretch your skills further with Ethan White and Zachary Brym’s excellent “Integrating R and SQL” assignment.
  2. Work through the Carpentries’ “SQL Databases and R” lesson. This great lesson instructs learners how to join tables- an exceptionally important skill for any SQL user- among other essential tasks.
  3. Dive deeper into the dbplyr package’s documentation to learn more about interfacing with SQL databases using R.

Home