The post below is a collection of useful (but kind of random) thoughts about R and SQL. It should serve more like a starting point, rather than exhaustive discussion of the presented topics. Note that it mostly describes SQLite - many concepts will be similar for different databases, but they might not be identical. You’ve been warned!
R, SQL, dbplyr and ORM
When we are talking about accessing SQL from a programming language, it is good to know a little bit about an ORM concept.
- https://en.wikipedia.org/wiki/Object%E2%80%93relational_mapping
- https://www.sqlalchemy.org/
- https://kurapov.ee/eng/tech/ORM-is-harfmul-pattern/
However when we are in R-land and we talk about dbplyr
we can think about it more as a transpiler, than ORM. It takes valid dplyr
code,
and translates it into SQL. We do not need to perform any object mapping or any similar activity.
Check the code below to see dbplyr
in action - note that the code is nearly identical to standar dplyr
. The only difference is that the references to the tables need to be created usign tbl
function, and at the end, to get data into R tibble
you need to call collect
.
library(RSQLite)
library(dbplyr)
library(DBI)
library(dplyr)
library(sqldf)
con <- dbplyr::nycflights13_sqlite()
res <- tbl(con, "flights") %>% group_by(origin, dest, year) %>% count() %>% ungroup() %>%
inner_join(tbl(con, "airports") %>% select(faa, origin_name = name), by = c("origin" = "faa")) %>%
inner_join(tbl(con, "airports") %>% select(faa, dest_name = name), by = c("dest" = "faa")) %>%
select(origin_name, dest_name, year, n) %>% arrange(desc(n))
final_table <- head(res) %>% collect()
final_table
## # A tibble: 6 × 4
## origin_name dest_name year n
## <chr> <chr> <int> <int>
## 1 John F Kennedy Intl Los Angeles Intl 2013 11262
## 2 La Guardia Hartsfield Jackson Atlanta Intl 2013 10263
## 3 La Guardia Chicago Ohare Intl 2013 8857
## 4 John F Kennedy Intl San Francisco Intl 2013 8204
## 5 La Guardia Charlotte Douglas Intl 2013 6168
## 6 Newark Liberty Intl Chicago Ohare Intl 2013 6100
res %>% show_query()
## <SQL>
## SELECT *
## FROM (
## SELECT
## `airports...2`.`name` AS `origin_name`,
## `airports...3`.`name` AS `dest_name`,
## `year`,
## `n`
## FROM (
## SELECT `origin`, `dest`, `year`, COUNT(*) AS `n`
## FROM `flights`
## GROUP BY `origin`, `dest`, `year`
## ) AS `...1`
## INNER JOIN `airports` AS `airports...2`
## ON (`...1`.`origin` = `airports...2`.`faa`)
## INNER JOIN `airports` AS `airports...3`
## ON (`...1`.`dest` = `airports...3`.`faa`)
## )
## ORDER BY `n` DESC
Important! dbplyr
cannot translate everything. E.g. for sqlite
back-end it cannot translate median
function (!!!).
# It won't work!
# tbl(con, "flights") %>% mutate(median(dep_delay))
# Caused by error in `median()`:
#! Window function `median()` is not supported by this database.
For more information please refer to https://dbplyr.tidyverse.org/articles/translation-function.html. To check how something is translated you can use translate_sql
function. I also strongly advise to check the generated queries from time to time using show_query
, just to make sure that dbplyr
really generates what you want (besides it is a very good way of learning SQL itself - if dbplyr
emits some SQL code that you don’t understand check the documentation).
translate_sql(lag(x))
## <SQL> LAG(`x`, 1, NULL) OVER ()
Inserting and updating data from R.
There is at least few ways how to insert data into SQL database:
DBI::dbWriteTable
dplyr::copy_to
- normal insert query using
DBI::dbExecute
dplyr
’srows_insert, rows_append, rows_update, rows_patch, rows_upsert, rows_delete
- here, you really need to read the documentation!
db_name <- tempfile()
con <- DBI::dbConnect(RSQLite::SQLite(), db_name)
dbWriteTable(con, name = "iris_table", iris)
df <- copy_to(con, iris)
DBI::dbListTables(con)
## [1] "iris" "iris_table" "sqlite_stat1" "sqlite_stat4"
DBI::dbDisconnect(con)
con <- DBI::dbConnect(RSQLite::SQLite(), db_name)
DBI::dbListTables(con) # note that `iris` table is gone. To make it persistent use temporary = FALSE
## [1] "iris_table"
df <- copy_to(con, iris, temporary = FALSE)
DBI::dbDisconnect(con)
con <- DBI::dbConnect(RSQLite::SQLite(), db_name)
DBI::dbListTables(con)
## [1] "iris" "iris_table" "sqlite_stat1" "sqlite_stat4"
# To create new tables or view you can use dbExecute
# if you don't know what a SQL view is - google it or read further.
# I am tackling that topic in the next section
sql_view <- "
CREATE VIEW iris_group_by AS
SELECT species, AVG(`Sepal.Length`) AS SEPAL_LEN_AVG FROM iris GROUP BY species
"
DBI::dbExecute(con, sql_view)
## [1] 0
DBI::dbListTables(con)
## [1] "iris" "iris_group_by" "iris_table" "sqlite_stat1"
## [5] "sqlite_stat4"
tbl(con, "iris_group_by")
## # Source: table<iris_group_by> [3 x 2]
## # Database: sqlite 3.40.1 [/tmp/RtmpZrU1d6/file23f83e4b00cac3]
## Species SEPAL_LEN_AVG
## <chr> <dbl>
## 1 setosa 5.01
## 2 versicolor 5.94
## 3 virginica 6.59
Some useful SQL terminology
Views
Views are a very handy concept for simplifying your life. If you need to use specific sub-query over and over again,
you can just create a “virtual” table using a view
. After that you can refer to it just like you refer to standard table.
For more information please refer to Wikipedia - it also list different uses cases for views (e.g. access restriction): - https://en.wikipedia.org/wiki/Materialized_view - https://en.wikipedia.org/wiki/View_(SQL)
db_name <- tempfile()
con <- DBI::dbConnect(RSQLite::SQLite(), db_name)
letter_column <- sample(LETTERS, replace = TRUE, size = 1e6)
value <- rnorm(n = length(letter_column), mean = as.integer(charToRaw(paste(letter_column, collapse = ""))))
letter_table <- tibble(letter = letter_column, value = value)
copy_to(con, letter_table, temporary = FALSE)
sql_view <- "
CREATE VIEW
letter_avg AS
SELECT
letter, AVG(value) AS AVG_VALUE
FROM
letter_table
GROUP BY
letter
"
DBI::dbExecute(con, sql_view)
## [1] 0
DBI::dbListTables(con)
## [1] "letter_avg" "letter_table" "sqlite_stat1" "sqlite_stat4"
tbl(con, "letter_avg")
## # Source: table<letter_avg> [?? x 2]
## # Database: sqlite 3.40.1 [/tmp/RtmpZrU1d6/file23f83eca2a6e7]
## letter AVG_VALUE
## <chr> <dbl>
## 1 A 65.0
## 2 B 66.0
## 3 C 67.0
## 4 D 68.0
## 5 E 69.0
## 6 F 70.0
## 7 G 71.0
## 8 H 72.0
## 9 I 73.0
## 10 J 74.0
## # … with more rows
# Just for fun - added new table with index to check how much time can be saved by using them in this example.
# !!! Indexes are something that you need to know if you work with SQL !!!
# In many cases when the query performance is waaay to slow the problem is somehow related to indexes (or lack of them).
copy_to(con, letter_table, temporary = FALSE, name = "letter_table_with_index")
sql_view <- "
CREATE VIEW letter_avg_with_index AS SELECT letter, AVG(value) AS AVG_VALUE
FROM letter_table_with_index GROUP BY letter
"
DBI::dbExecute(con, sql_view)
## [1] 0
DBI::dbExecute(con, "CREATE INDEX IF NOT EXISTS letters_index ON letter_table_with_index (letter)")
## [1] 0
microbenchmark::microbenchmark(times = 5,
tbl(con, "letter_avg") %>% collect(),
tbl(con, "letter_table_with_index") %>% collect(),
letter_table %>% group_by(letter) %>% summarise(avg = mean(value))
)
## Unit: milliseconds
## expr min
## tbl(con, "letter_avg") %>% collect() 463.28748
## tbl(con, "letter_table_with_index") %>% collect() 405.55080
## letter_table %>% group_by(letter) %>% summarise(avg = mean(value)) 27.88042
## lq mean median uq max neval cld
## 464.53191 482.88718 489.75732 492.96234 503.89684 5 c
## 410.16418 422.87106 425.01228 429.75398 443.87406 5 b
## 30.49285 34.06109 34.11852 37.92238 39.89127 5 a
CTE
If you need to write some SQL by hand, consider using CTE rather than nested expressions. Compare the code generated by dplyr
and the code written using CTE - for me, CTE version is much easier to digest - it reads from the top to the bottom. dplyr
’s query needs to be read from inside.
# install.packages("nycflights13")
con <- dbplyr::nycflights13_sqlite()
tbl(con, "airports")
## # Source: table<airports> [?? x 8]
## # Database: sqlite 3.40.1 [/tmp/RtmpZrU1d6/nycflights13.sqlite]
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/…
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/…
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/…
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/…
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/…
## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/…
## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/…
## 8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/…
## 9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/…
## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
## # … with more rows
airlines <- tbl(con, "airlines") %>% filter(name %LIKE% "%America%")
flights <- tbl(con, "flights") %>% select(carrier, tailnum) %>% distinct()
result <- inner_join(airlines, flights, by = c("carrier")) %>% group_by(carrier, name) %>% count()
result %>% show_query()
## <SQL>
## SELECT `carrier`, `name`, COUNT(*) AS `n`
## FROM (
## SELECT `LHS`.*, `tailnum`
## FROM (
## SELECT *
## FROM `airlines`
## WHERE (`name` LIKE '%America%')
## ) AS `LHS`
## INNER JOIN (
## SELECT DISTINCT `carrier`, `tailnum`
## FROM `flights`
## ) AS `RHS`
## ON (`LHS`.`carrier` = `RHS`.`carrier`)
## )
## GROUP BY `carrier`, `name`
cte_query <- "
WITH america_airlines AS (
SELECT *
FROM airlines
WHERE (name LIKE '%America%')
),
distinct_flights AS (
SELECT DISTINCT carrier, tailnum FROM flights
),
joined_data AS (
SELECT * FROM america_airlines a INNER JOIN distinct_flights f
ON a.carrier = f.carrier
)
SELECT carrier, name, COUNT(*) AS `n`
FROM joined_data
GROUP BY carrier, name
"
DBI::dbGetQuery(con, cte_query)
## carrier name n
## 1 AA American Airlines Inc. 601
## 2 VX Virgin America 53
Window functions
Window functions in SQL are a very broad topic. They can be used to calculate cumulative sums, running averages or adding summary function alongside original data (similar to iris %>% group_by(Species) %>% mutate(Sep_len_avg = mean(Sepal.Length))
). Without knowing them some tasks might become unnecessary complicated (e.g. adding a column with averages would require to create a summary table that would be joined back to the original table).
con <- dbplyr::nycflights13_sqlite()
# calculate a ratio of number of flights in a given day to number of flights in a month
# window function is used to create `monthly` column. See in the generated SQL.
tbl_res <- tbl(con, "flights") %>% group_by(year, month, day) %>% count(name = "daily") %>%
group_by(year, month) %>% mutate(monthly = sum(daily)) %>% mutate(dayily_ratio = daily/as.numeric(monthly))
tbl_res %>% head %>% collect()
## Warning: Missing values are always removed in SQL aggregation functions.
## Use `na.rm = TRUE` to silence this warning
## This warning is displayed once every 8 hours.
## # A tibble: 6 × 6
## # Groups: year, month [1]
## year month day daily monthly dayily_ratio
## <int> <int> <int> <int> <int> <dbl>
## 1 2013 1 1 842 27004 0.0312
## 2 2013 1 2 943 27004 0.0349
## 3 2013 1 3 914 27004 0.0338
## 4 2013 1 4 915 27004 0.0339
## 5 2013 1 5 720 27004 0.0267
## 6 2013 1 6 832 27004 0.0308
tbl_res %>% show_query()
## <SQL>
## SELECT *, `daily` / CAST(`monthly` AS REAL) AS `dayily_ratio`
## FROM (
## SELECT *, SUM(`daily`) OVER (PARTITION BY `year`, `month`) AS `monthly`
## FROM (
## SELECT `year`, `month`, `day`, COUNT(*) AS `daily`
## FROM `flights`
## GROUP BY `year`, `month`, `day`
## )
## )
# Adding a column with mean Sepal.Width to the original data.
# sqldf is cool package that allows you to write SQL on R tables.
# in the background it copies the data into SQLite, executes query,
# and copies the result back into R. I was using it more before dplyr and dbplyr
# times. But still I think that it's cool package:)
iris2 <- iris[c(1:3,51:53,101:103),]
sqldf("SELECT *, AVG(\"Sepal.Width\") OVER (PARTITION BY \"Species\") from iris2")
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 7.0 3.2 4.7 1.4 versicolor
## 5 6.4 3.2 4.5 1.5 versicolor
## 6 6.9 3.1 4.9 1.5 versicolor
## 7 6.3 3.3 6.0 2.5 virginica
## 8 5.8 2.7 5.1 1.9 virginica
## 9 7.1 3.0 5.9 2.1 virginica
## AVG("Sepal.Width") OVER (PARTITION BY "Species")
## 1 3.233333
## 2 3.233333
## 3 3.233333
## 4 3.166667
## 5 3.166667
## 6 3.166667
## 7 3.000000
## 8 3.000000
## 9 3.000000
sqldf(
"SELECT
species,
\"Sepal.Width\",
SUM(\"Sepal.Width\") OVER (PARTITION BY species ORDER BY species ROWS UNBOUNDED PRECEDING) AS cusum,
row_number() OVER (PARTITION BY species)
FROM
iris2")
## Species Sepal.Width cusum row_number() OVER (PARTITION BY species)
## 1 setosa 3.5 3.5 1
## 2 setosa 3.0 6.5 2
## 3 setosa 3.2 9.7 3
## 4 versicolor 3.2 3.2 1
## 5 versicolor 3.2 6.4 2
## 6 versicolor 3.1 9.5 3
## 7 virginica 3.3 3.3 1
## 8 virginica 2.7 6.0 2
## 9 virginica 3.0 9.0 3
Recursive CTE
Recursive CTE are one of those things that very rarely comes in handy, so by the time you have to use it, you will have forgotten its syntax. From my experience, whenever you hear “We need to use a graph database for it!”, think about “recursive CTE”. They might not be the “perfect” tool for the job, but they might be performant enough.
db <- dbConnect(RSQLite::SQLite(), tempfile())
sql <- "
CREATE TABLE org(
name TEXT PRIMARY KEY,
boss TEXT REFERENCES org
) WITHOUT ROWID;
"
dbExecute(db, sql)
## [1] 0
values <- "INSERT INTO org VALUES('Alice',NULL);
INSERT INTO org VALUES('Bob','Alice');
INSERT INTO org VALUES('Cindy','Alice');
INSERT INTO org VALUES('Dave','Bob');
INSERT INTO org VALUES('Emma','Bob');
INSERT INTO org VALUES('Fred','Cindy');
INSERT INTO org VALUES('Gail','Cindy');
INSERT INTO org VALUES('Lindy',NULL);
INSERT INTO org VALUES('John','Lindy');
INSERT INTO org VALUES('Luke','John');
INSERT INTO org VALUES('Megan','Lindy');"
values <- stringr::str_split(values, pattern = "\n", simplify = TRUE) %>% as.character()
purrr::map(values, function(x) dbExecute(db, x)) %>% invisible()
dbGetQuery(conn = db, "
WITH RECURSIVE
under_alice(name,level) AS (
VALUES('Alice',0)
UNION ALL
SELECT org.name, under_alice.level+1
FROM org JOIN under_alice ON org.boss=under_alice.name
ORDER BY 2 DESC
)
SELECT substr('..........',1,level*3) || name AS structure FROM under_alice;
")
## structure
## 1 Alice
## 2 ...Bob
## 3 ......Dave
## 4 ......Emma
## 5 ...Cindy
## 6 ......Fred
## 7 ......Gail
dbGetQuery(conn = db, "
WITH RECURSIVE
under_alice(name,level) AS (
SELECT org.name, 0 as level from org where org.boss IS NULL
UNION ALL
SELECT org.name, under_alice.level+1
FROM org JOIN under_alice ON org.boss=under_alice.name
ORDER BY 2 DESC
)
SELECT name || substr('..........',1,level*3) AS structure FROM under_alice;
")
## structure
## 1 Alice
## 2 Bob...
## 3 Dave......
## 4 Emma......
## 5 Cindy...
## 6 Fred......
## 7 Gail......
## 8 Lindy
## 9 John...
## 10 Luke......
## 11 Megan...
Query explainer and query optimizer
Query explainer and query optimizer are two additional topics that are worthwhile to be aware of. In a nutshell, SQL engine before executing the query prepares an execution plan (e.g. which indexes are going to be used, in which order tables are going to be join, etc). If the query is too slow you might check the prepared plan to see what’s going on - e.g. you might notice that SQL engine is performing full table scan (reads all rows) rather than using an index.
As always query plan output is database specific, so all the knowledge gathered for one engine might not be transferable to the other one.
A good place to start is as always SQLite site: - https://www.sqlite.org/eqp.html - query explain. - https://www.sqlite.org/optoverview.html query optimizer.
con <- dbplyr::nycflights13_sqlite()
tbl_res <- tbl(con, "flights") %>% group_by(year, month, day) %>% count(name = "daily") %>%
group_by(year, month) %>% mutate(monthly = sum(daily)) %>% mutate(dayily_ratio = daily/as.numeric(monthly))
query <- capture.output(tbl_res %>% show_query())
dbGetQuery(con, paste("EXPLAIN QUERY PLAN", paste(query[-1], collapse = "\n")))
## id parent notused detail
## 1 2 0 0 CO-ROUTINE (subquery-2)
## 2 5 2 0 CO-ROUTINE (subquery-4)
## 3 7 5 0 CO-ROUTINE (subquery-1)
## 4 13 7 0 SCAN flights USING COVERING INDEX flights_year_month_day
## 5 46 5 0 SCAN (subquery-1)
## 6 59 5 0 USE TEMP B-TREE FOR ORDER BY
## 7 79 2 0 SCAN (subquery-4)
## 8 139 0 0 SCAN (subquery-2)
Row vs columnar databases
There are many different kinds of databases - relational, SQL, noSQL, etc. One possible distinction might be a result of how the data is organized in the table. In row-oriented database all data associated with the same record are stored together, next to each other. This layout makes adding new records very fast. Contrary, in the columnar layout all data that belongs to the same column are stored next to each other. This approach makes the data much more friendly for the processor cache, which makes columnar databases more performant. Usually, for analytics you should use columnar databses.
The example below shows the comparison of the execution time between row-oriented database (SQLite) and columnar database (duckdb) for simeple aggregation query. No surprises, columnar database is faster.
con <- DBI::dbConnect(RSQLite::SQLite(), tempfile())
letter_column <- sample(LETTERS, replace = TRUE, size = 1e6)
value <- rnorm(n = length(letter_column), mean = as.integer(charToRaw(paste(letter_column, collapse = ""))))
letter_table <- tibble(letter = letter_column, value = value) %>% arrange(letter)
dbWriteTable(con, "letter", letter_table)
system.time(tbl(con, "letter") %>% group_by(letter) %>% summarise(mean(value)) %>% collect())
## user system elapsed
## 0.436 0.025 0.461
library(duckdb)
con_duck <- dbConnect(duckdb(tempfile()))
dbWriteTable(con_duck, "letter", letter_table)
system.time(tbl(con_duck, "letter") %>% group_by(letter) %>% summarise(mean(value)) %>% collect())
## user system elapsed
## 0.131 0.012 0.113
microbenchmark::microbenchmark(times = 10,
tbl(con, "letter") %>% group_by(letter) %>% summarise(mean(value)) %>% collect(),
tbl(con_duck, "letter") %>% group_by(letter) %>% summarise(mean(value)) %>% collect()
)
## Unit: milliseconds
## expr
## tbl(con, "letter") %>% group_by(letter) %>% summarise(mean(value)) %>% collect()
## tbl(con_duck, "letter") %>% group_by(letter) %>% summarise(mean(value)) %>% collect()
## min lq mean median uq max neval cld
## 410.55408 415.80349 423.26792 420.89233 428.5054 444.0424 10 b
## 88.92357 93.88219 99.39049 98.74699 104.6958 109.3912 10 a
DBI::dbExecute(con, "CREATE INDEX IF NOT EXISTS letters_index ON letter (letter)")
## [1] 0
DBI::dbExecute(con_duck, "CREATE INDEX IF NOT EXISTS letters_index ON letter (letter)")
## [1] 0
microbenchmark::microbenchmark(times = 10,
tbl(con, "letter") %>% group_by(letter) %>% summarise(mean(value)) %>% collect(),
tbl(con_duck, "letter") %>% group_by(letter) %>% summarise(mean(value)) %>% collect()
)
## Unit: milliseconds
## expr
## tbl(con, "letter") %>% group_by(letter) %>% summarise(mean(value)) %>% collect()
## tbl(con_duck, "letter") %>% group_by(letter) %>% summarise(mean(value)) %>% collect()
## min lq mean median uq max neval cld
## 204.44771 207.6635 214.65076 211.21876 218.63256 231.55214 10 b
## 86.25933 87.6635 90.16922 89.43304 92.84429 96.74997 10 a
Rendering SQL using R code.
When you programmatically create SQL in R, DO NOT USE paste
or glue
to fill parameters. Use glue_sql
instead. It sanitizes the input, making it safe from
different kinds of attacks that allows unauthorized person to access or destroy data in your tables.
important_table <- tibble(id = 1:5, value = LETTERS[1:5])
id <- 1
(sql <- glue::glue("SELECT * FROM important_table WHERE id = '{id}'"))
## SELECT * FROM important_table WHERE id = '1'
sqldf(sql)
## id value
## 1 1 A
id <- "1' OR 1=1 --;"
(sql <- glue::glue("SELECT * FROM important_table WHERE id = '{id}'"))
## SELECT * FROM important_table WHERE id = '1' OR 1=1 --;'
sqldf(sql)
## id value
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
Now let’s see how to destroy a table:
db <- dbConnect(RSQLite::SQLite(), tempfile())
dbWriteTable(db, "important_table", important_table)
tbl(db, "important_table")
## # Source: table<important_table> [5 x 2]
## # Database: sqlite 3.40.1 [/tmp/RtmpZrU1d6/file23f83e3ac1feea]
## id value
## <int> <chr>
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
id <- 1
(sql <- glue::glue("UPDATE important_table SET value = 'XD' WHERE id = '{id}'"))
## UPDATE important_table SET value = 'XD' WHERE id = '1'
dbExecute(db, sql)
## [1] 1
tbl(db, "important_table")
## # Source: table<important_table> [5 x 2]
## # Database: sqlite 3.40.1 [/tmp/RtmpZrU1d6/file23f83e3ac1feea]
## id value
## <int> <chr>
## 1 1 XD
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
# Destroy the whole table!
# https://xkcd.com/327/
id <- "1' OR 1=1 --;"
(sql <- glue::glue("UPDATE important_table SET value = 'XD' WHERE id = '{id}'"))
## UPDATE important_table SET value = 'XD' WHERE id = '1' OR 1=1 --;'
dbExecute(db, sql)
## [1] 5
tbl(db, "important_table")
## # Source: table<important_table> [5 x 2]
## # Database: sqlite 3.40.1 [/tmp/RtmpZrU1d6/file23f83e3ac1feea]
## id value
## <int> <chr>
## 1 1 XD
## 2 2 XD
## 3 3 XD
## 4 4 XD
## 5 5 XD
## Now with glue_sql
db <- dbConnect(RSQLite::SQLite(), tempfile())
dbWriteTable(db, "important_table", important_table)
tbl(db, "important_table")
## # Source: table<important_table> [5 x 2]
## # Database: sqlite 3.40.1 [/tmp/RtmpZrU1d6/file23f83e53a7f619]
## id value
## <int> <chr>
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
id <- "1"
(sql <- glue::glue_sql("UPDATE important_table SET value = 'XD' WHERE id = {id}", .con = db))
## <SQL> UPDATE important_table SET value = 'XD' WHERE id = '1'
dbExecute(db, sql)
## [1] 1
tbl(db, "important_table")
## # Source: table<important_table> [5 x 2]
## # Database: sqlite 3.40.1 [/tmp/RtmpZrU1d6/file23f83e53a7f619]
## id value
## <int> <chr>
## 1 1 XD
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
# Table is safe!
id <- "1' OR 1=1 --;"
(sql <- glue::glue_sql("UPDATE important_table SET value = 'XD' WHERE id = {id}", .con = db))
## <SQL> UPDATE important_table SET value = 'XD' WHERE id = '1'' OR 1=1 --;'
dbExecute(db, sql)
## [1] 0
tbl(db, "important_table")
## # Source: table<important_table> [5 x 2]
## # Database: sqlite 3.40.1 [/tmp/RtmpZrU1d6/file23f83e53a7f619]
## id value
## <int> <chr>
## 1 1 XD
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
Moral is short - use glue_sql
or dbplyr
or create SQL queries.
Conclusion
The list of different topics presented above is not exhaustive, probably not even in a correct order of importance. But it should be a good start.