Estimated lesson teaching time: 20 minutes
Assumed background knowledge: Basic dplyr commands, how to use magrittr
pipes (%>%
)
Many of us are accustomed to producing our own data or downloading it from repositories, saving it to a folder on our local machines, and then analyzing it in a program like RStudio. However, sometimes it’s not possible or practical to store data on a local drive, such as if it is very large or of a sensitive nature. In cases like this, it’s better to store the data on a remote server and obtain the information we need by querying (retrieving specific information from) the dataset, rather than downloading the entire thing and working with it locally. A commonly-used language for querying databases is called SQL, pronounced “S-Q-L” or “sequel”. It stands for “Structured Query Language”.
A few reasons:
SQL proficiency is frequently cited as a sought-after skill in many fields, including data science. It’s safe to say that knowing SQL is typically a very helpful asset when entering data science.
Though we can’t provide a comprehensive introduction to SQL, we thought it’d be helpful to offer a little motivation to learn more and a small push to help get you started.
Many academic ecologists are used to collecting their own data and storing it locally. This lesson will help acquaint you with practices frequently used in industry, where it is much more common to work with data stored in remote databases.
Important aside: Data science is an exceptionally diverse field that is impossible to describe with just a few sentences and includes highly valued workers who do not use SQL regularly.
We have 3 goals for today’s lesson:
dplyr
syntaxdplyr
to SQL and vice versaBy accomplishing these goals, we hope to give you enough confidence to continue learning SQL and better prepare yourself for a career involving data science.
Note: This lesson was structured similarly to the Data Carpentry “SQL and R Databases” lesson but is shorter and designed for our specific attendee group.
install.packages(c("dplyr", "dbplyr", "RSQLite", "magrittr"))
library("dplyr")
library("dbplyr")
library("RSQLite")
library("magrittr")
First, create a new folder in your home directory called “data_raw”. Then, download the SQLite database. Finally, connect with the database and name it “mammals”.
# create raw data folder
dir.create("data_raw",
showWarnings = FALSE)
# download sqlite database
download.file(url = "https://ndownloader.figshare.com/files/2292171",
destfile = "data_raw/portal_mammals.sqlite",
mode = "wb")
# access the database and name it "mammals"
mammals <- DBI::dbConnect(
RSQLite::SQLite(),
"data_raw/portal_mammals.sqlite")
Hold up. What is going on in that last line of code?
dbConnect
is a function from the DBI
package. We’re using it to send R commands to our database and can interface with lots of databases regardless of the management system (e.g. MySQL, PostgreSQL, BigQuery, etc.)
The SQLite
function from the RSQLite
package lets us R users interface with SQLite databases specifically. We are telling R to connect to the portal_mammals.sqlite
database.
First, let’s see what our data looks like.
src_dbi(mammals)
## src: sqlite 3.39.1 [C:\Users\sbrei\Documents\R_Projects\Collabs\ESA_CSEE_2022\SQL_tutorial\SQL_intro\data_raw\portal_mammals.sqlite]
## tbls: plots, species, surveys
We see that this database contains three tables: plots, species, and surveys. Let’s take a closer look at the surveys table.
surveys <- tbl(mammals, "surveys")
print(surveys)
## # Source: table<surveys> [?? x 9]
## # Database: sqlite 3.39.1 [C:\Users\sbrei\Documents\R_Projects\Collabs\ESA_CSEE_2022\SQL_tutorial\SQL_intro\data_raw\portal_mammals.sqlite]
## record_id month day year plot_id species_id sex hindfoot_length weight
## <int> <int> <int> <int> <int> <chr> <chr> <int> <int>
## 1 1 7 16 1977 2 NL M 32 NA
## 2 2 7 16 1977 3 NL M 33 NA
## 3 3 7 16 1977 2 DM F 37 NA
## 4 4 7 16 1977 7 DM M 36 NA
## 5 5 7 16 1977 3 DM M 35 NA
## 6 6 7 16 1977 1 PF M 14 NA
## 7 7 7 16 1977 2 PE F NA NA
## 8 8 7 16 1977 1 DM M 37 NA
## 9 9 7 16 1977 1 DM F 34 NA
## 10 10 7 16 1977 6 PF F 20 NA
## # ... with more rows
## # i Use `print(n = ...)` to see more rows
The first line of the output imparts an important lesson: R tries to be as lazy as possible when interfacing with databases. For instance, this simple command does not tell us the dimensions of the surveys table. The length is unknown (“??”) but the width = 9 columns. Unless we ask it for the length specifically, it will not tell us because R did not actually view the entire table when fetching the first 10 rows for us.
The table itself looks similar to what we’re used to manipulating in R as data frames. There are at least 10 rows and several columns, including “record_id”, “month”, and “day”. This table contains information about mammals observed in different plots over several years.
Let’s start wrangling the data and ask: how many males were observed in 1978 in plot 2?
Challenge: convert this question into a dplyr command. (To see the answer, press the “CODE” button on the right side of the page).
surveys %>%
filter(sex == "M",
year == "1978",
plot_id == "2") %>%
count() # there are 52 individuals.
## # Source: SQL [1 x 1]
## # Database: sqlite 3.39.1 [C:\Users\sbrei\Documents\R_Projects\Collabs\ESA_CSEE_2022\SQL_tutorial\SQL_intro\data_raw\portal_mammals.sqlite]
## n
## <int>
## 1 52
As you can see, querying the surveys table with a question like this isn’t really any different than working with an R data frame!
If you want to save a subset of a table as a new object, then take note of a few differences when working with SQL databases.
Specifically, we need to add one more line of code to tell our lazy bones of an R interface to finish the command.
Example: ask R to save a subset of the surveys table that 1) includes data from 1998-2000 and 2) counts up the number of species seen per year.
annual_species <- surveys %>%
filter(year == "1998" |
year == "1999" |
year == "2000") %>%
group_by(year, species_id) %>%
summarise(n = n()) %>%
collect() # <- don't forget this command!
## `summarise()` has grouped output by "year". You can override using the
## `.groups` argument.
Specifically, the collect()
command is essential for telling R that we’re done creating a query and are ready to implement it.
Notice the change in data structure if we do not include the collect()
command:
annual_species_notcollected <- surveys %>%
filter(year == "1998" |
year == "1999" |
year == "2000") %>%
group_by(year, species_id) %>%
summarise(n = n())
class(annual_species_notcollected)
## [1] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql"
## [4] "tbl_lazy" "tbl"
class(annual_species)
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
annual_species_notcollected
is not a data frame.
annual_species
is a data frame.
A simple way to translate your dplyr
syntax to SQL syntax is to pipe in the show_query()
function.
Translated R –> SQL syntax
annual_species_notcollected %>%
show_query()
## <SQL>
## SELECT `year`, `species_id`, COUNT(*) AS `n`
## FROM `surveys`
## WHERE (`year` = '1998' OR `year` = '1999' OR `year` = '2000')
## GROUP BY `year`, `species_id`
Original R syntax
annual_species_notcollected <- surveys %>%
filter(year == "1998", "1999", "2000") %>%
group_by(year, species_id) %>%
summarise(n = n())
Notice: How does the original R code compare to the translated SQL syntax?
This tool is very handy for learning SQL, but practitioners recommend learning at least basic SQL syntax when getting into data science. Another reason why? So that you can read, in addition to write, SQL.
Translate this SQL query into R code. Click here for a straightforward SQL cheat sheet and a dplyr cheat sheet to help get you started. Remember to unhide the correct translation by clicking the “CODE” button below the SQL syntax section.
SQL syntax
SELECT *
FROM (SELECT record_id
, hindfoot_length
FROM surveys
)
WHERE (hindfoot_length
> 20.0)
R syntax translated:
surveys %>%
select(record_id, hindfoot_length) %>%
filter(hindfoot_length > 20)
Translate this dplyr
code into an SQL query by filling in the blanks. Remember to unhide the correct translation by clicking the “CODE” button below the SQL syntax section.
R syntax
surveys %>%
filter(year > 1977 & year < 2000) %>%
filter(hindfoot_length == 30) %>%
arrange(desc(species_id))
SQL syntax
SELECT *
FROM (SELECT *
FROM ____
WHERE (year
> 1977.0 ____ year
< 2000.0))
____ (hindfoot_length
= 30.0)
ORDER BY species_id
____
HINT: For a word bank of possible SQL terms, click the “CODE” button directly below ↓
AND
DESC
`surveys`
WHERE
SQL syntax translated:
Underscrores around words (e.g. __example__) indicate correct fill-in-the-blank answers.
SELECT *
FROM (SELECT *
FROM __`surveys`__
WHERE (`year` > 1977.0 __AND__ `year` < 2000.0))
__WHERE__ (`hindfoot_length` = 30.0)
ORDER BY `species_id` __DESC__
Here are some resources to help you continue on your SQL-learning path.
dbplyr
package’s documentation to learn more about interfacing with SQL databases using R.