The post below is a collection of useful (but kind of random) thoughts about R and SQL. It should serve more like a starting point, rather than exhaustive discussion of the presented topics. Note that it mostly describes SQLite - many concepts will be similar for different databases, but they might not be identical. You’ve been warned!

R, SQL, dbplyr and ORM

When we are talking about accessing SQL from a programming language, it is good to know a little bit about an ORM concept.

However when we are in R-land and we talk about dbplyr we can think about it more as a transpiler, than ORM. It takes valid dplyr code, and translates it into SQL. We do not need to perform any object mapping or any similar activity.

Check the code below to see dbplyr in action - note that the code is nearly identical to standar dplyr. The only difference is that the references to the tables need to be created usign tbl function, and at the end, to get data into R tibble you need to call collect.

library(RSQLite)
library(dbplyr)
library(DBI)
library(dplyr)
library(sqldf)

con <- dbplyr::nycflights13_sqlite()
res <- tbl(con, "flights") %>% group_by(origin, dest, year) %>% count() %>% ungroup() %>%
  inner_join(tbl(con, "airports") %>% select(faa, origin_name = name), by = c("origin" = "faa")) %>%
  inner_join(tbl(con, "airports") %>% select(faa, dest_name = name), by = c("dest" = "faa")) %>%
  select(origin_name, dest_name, year, n) %>% arrange(desc(n))

final_table <- head(res) %>% collect()
final_table

## # A tibble: 6 × 4
##   origin_name         dest_name                        year     n
##   <chr>               <chr>                           <int> <int>
## 1 John F Kennedy Intl Los Angeles Intl                 2013 11262
## 2 La Guardia          Hartsfield Jackson Atlanta Intl  2013 10263
## 3 La Guardia          Chicago Ohare Intl               2013  8857
## 4 John F Kennedy Intl San Francisco Intl               2013  8204
## 5 La Guardia          Charlotte Douglas Intl           2013  6168
## 6 Newark Liberty Intl Chicago Ohare Intl               2013  6100

res %>% show_query()

## <SQL>
## SELECT *
## FROM (
##   SELECT
##     `airports...2`.`name` AS `origin_name`,
##     `airports...3`.`name` AS `dest_name`,
##     `year`,
##     `n`
##   FROM (
##     SELECT `origin`, `dest`, `year`, COUNT(*) AS `n`
##     FROM `flights`
##     GROUP BY `origin`, `dest`, `year`
##   ) AS `...1`
##   INNER JOIN `airports` AS `airports...2`
##     ON (`...1`.`origin` = `airports...2`.`faa`)
##   INNER JOIN `airports` AS `airports...3`
##     ON (`...1`.`dest` = `airports...3`.`faa`)
## )
## ORDER BY `n` DESC

Important! dbplyr cannot translate everything. E.g. for sqlite back-end it cannot translate median function (!!!).

# It won't work!
# tbl(con, "flights") %>% mutate(median(dep_delay))
# Caused by error in `median()`:
#! Window function `median()` is not supported by this database.

For more information please refer to https://dbplyr.tidyverse.org/articles/translation-function.html. To check how something is translated you can use translate_sql function. I also strongly advise to check the generated queries from time to time using show_query, just to make sure that dbplyr really generates what you want (besides it is a very good way of learning SQL itself - if dbplyr emits some SQL code that you don’t understand check the documentation).

translate_sql(lag(x))

## <SQL> LAG(`x`, 1, NULL) OVER ()

Inserting and updating data from R.

There is at least few ways how to insert data into SQL database:

DBI::dbWriteTable
dplyr::copy_to
normal insert query using DBI::dbExecute
dplyr’s rows_insert, rows_append, rows_update, rows_patch, rows_upsert, rows_delete - here, you really need to read the documentation!

db_name <- tempfile()
con <- DBI::dbConnect(RSQLite::SQLite(), db_name)
dbWriteTable(con, name = "iris_table", iris)
df <- copy_to(con, iris)
DBI::dbListTables(con)

## [1] "iris"         "iris_table"   "sqlite_stat1" "sqlite_stat4"

DBI::dbDisconnect(con)

con <- DBI::dbConnect(RSQLite::SQLite(), db_name)
DBI::dbListTables(con) # note that `iris` table is gone. To make it persistent use temporary = FALSE

## [1] "iris_table"

df <- copy_to(con, iris, temporary = FALSE)
DBI::dbDisconnect(con)

con <- DBI::dbConnect(RSQLite::SQLite(), db_name)
DBI::dbListTables(con)

## [1] "iris"         "iris_table"   "sqlite_stat1" "sqlite_stat4"

# To create new tables or view you can use dbExecute
# if you don't know what a SQL view is - google it or read further.
# I am tackling that topic in the next section
sql_view <- "
CREATE VIEW iris_group_by AS
  SELECT species, AVG(`Sepal.Length`) AS SEPAL_LEN_AVG FROM iris GROUP BY species 
"
DBI::dbExecute(con, sql_view)

## [1] 0

DBI::dbListTables(con)

## [1] "iris"          "iris_group_by" "iris_table"    "sqlite_stat1" 
## [5] "sqlite_stat4"

tbl(con, "iris_group_by")

## # Source:   table<iris_group_by> [3 x 2]
## # Database: sqlite 3.40.1 [/tmp/RtmpZrU1d6/file23f83e4b00cac3]
##   Species    SEPAL_LEN_AVG
##   <chr>              <dbl>
## 1 setosa              5.01
## 2 versicolor          5.94
## 3 virginica           6.59

Some useful SQL terminology

Views

Views are a very handy concept for simplifying your life. If you need to use specific sub-query over and over again, you can just create a “virtual” table using a view. After that you can refer to it just like you refer to standard table.

For more information please refer to Wikipedia - it also list different uses cases for views (e.g. access restriction): - https://en.wikipedia.org/wiki/Materialized_view - https://en.wikipedia.org/wiki/View_(SQL)

db_name <- tempfile()
con <- DBI::dbConnect(RSQLite::SQLite(), db_name)

letter_column <- sample(LETTERS, replace = TRUE, size = 1e6)
value <- rnorm(n = length(letter_column), mean = as.integer(charToRaw(paste(letter_column, collapse = ""))))
letter_table <- tibble(letter = letter_column, value = value)
copy_to(con, letter_table, temporary = FALSE)

sql_view <- "
  CREATE VIEW 
    letter_avg AS
  SELECT
    letter, AVG(value) AS AVG_VALUE 
  FROM
    letter_table
  GROUP BY
    letter
"
DBI::dbExecute(con, sql_view)

## [1] 0

DBI::dbListTables(con)

## [1] "letter_avg"   "letter_table" "sqlite_stat1" "sqlite_stat4"

tbl(con, "letter_avg")

## # Source:   table<letter_avg> [?? x 2]
## # Database: sqlite 3.40.1 [/tmp/RtmpZrU1d6/file23f83eca2a6e7]
##    letter AVG_VALUE
##    <chr>      <dbl>
##  1 A           65.0
##  2 B           66.0
##  3 C           67.0
##  4 D           68.0
##  5 E           69.0
##  6 F           70.0
##  7 G           71.0
##  8 H           72.0
##  9 I           73.0
## 10 J           74.0
## # … with more rows

# Just for fun - added new table with index to check how much time can be saved by using them in this example.
# !!! Indexes are something that you need to know if you work with SQL !!!
# In many cases when the query performance is waaay to slow the problem is somehow related to indexes (or lack of them).

copy_to(con, letter_table, temporary = FALSE, name = "letter_table_with_index")
sql_view <- "
  CREATE VIEW letter_avg_with_index AS SELECT letter, AVG(value) AS AVG_VALUE 
  FROM letter_table_with_index GROUP BY letter
"
DBI::dbExecute(con, sql_view)

## [1] 0

DBI::dbExecute(con, "CREATE INDEX IF NOT EXISTS letters_index ON letter_table_with_index (letter)")

## [1] 0

microbenchmark::microbenchmark(times = 5,
  tbl(con, "letter_avg") %>% collect(),
  tbl(con, "letter_table_with_index") %>% collect(),
  letter_table %>% group_by(letter) %>% summarise(avg = mean(value))
)

## Unit: milliseconds
##                                                                expr       min
##                                tbl(con, "letter_avg") %>% collect() 463.28748
##                   tbl(con, "letter_table_with_index") %>% collect() 405.55080
##  letter_table %>% group_by(letter) %>% summarise(avg = mean(value))  27.88042
##         lq      mean    median        uq       max neval cld
##  464.53191 482.88718 489.75732 492.96234 503.89684     5   c
##  410.16418 422.87106 425.01228 429.75398 443.87406     5  b 
##   30.49285  34.06109  34.11852  37.92238  39.89127     5 a

CTE

If you need to write some SQL by hand, consider using CTE rather than nested expressions. Compare the code generated by dplyr and the code written using CTE - for me, CTE version is much easier to digest - it reads from the top to the bottom. dplyr’s query needs to be read from inside.

https://learnsql.com/blog/what-is-common-table-expression/

# install.packages("nycflights13")
con <- dbplyr::nycflights13_sqlite()

tbl(con, "airports")

## # Source:   table<airports> [?? x 8]
## # Database: sqlite 3.40.1 [/tmp/RtmpZrU1d6/nycflights13.sqlite]
##    faa   name                             lat    lon   alt    tz dst   tzone    
##    <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
##  1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
##  2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
##  3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
##  4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
##  5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
##  6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
##  7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
##  8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
##  9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
## 10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
## # … with more rows

airlines <- tbl(con, "airlines") %>% filter(name %LIKE% "%America%")
flights <- tbl(con, "flights") %>% select(carrier, tailnum) %>% distinct()

result <- inner_join(airlines, flights, by = c("carrier")) %>% group_by(carrier, name) %>% count()

result %>% show_query()

## <SQL>
## SELECT `carrier`, `name`, COUNT(*) AS `n`
## FROM (
##   SELECT `LHS`.*, `tailnum`
##   FROM (
##     SELECT *
##     FROM `airlines`
##     WHERE (`name` LIKE '%America%')
##   ) AS `LHS`
##   INNER JOIN (
##     SELECT DISTINCT `carrier`, `tailnum`
##     FROM `flights`
##   ) AS `RHS`
##     ON (`LHS`.`carrier` = `RHS`.`carrier`)
## )
## GROUP BY `carrier`, `name`

cte_query <- "
WITH america_airlines AS (
  SELECT *
    FROM airlines
    WHERE (name LIKE '%America%')
),
distinct_flights AS (
  SELECT DISTINCT carrier, tailnum FROM flights
),
joined_data AS (
  SELECT * FROM america_airlines a INNER JOIN distinct_flights f
  ON a.carrier = f.carrier
)
SELECT carrier, name, COUNT(*) AS `n`
FROM joined_data
GROUP BY carrier, name
"

DBI::dbGetQuery(con, cte_query)

##   carrier                   name   n
## 1      AA American Airlines Inc. 601
## 2      VX         Virgin America  53

Window functions

Window functions in SQL are a very broad topic. They can be used to calculate cumulative sums, running averages or adding summary function alongside original data (similar to iris %>% group_by(Species) %>% mutate(Sep_len_avg = mean(Sepal.Length))). Without knowing them some tasks might become unnecessary complicated (e.g. adding a column with averages would require to create a summary table that would be joined back to the original table).

con <- dbplyr::nycflights13_sqlite()

# calculate a ratio of number of flights in a given day to number of flights in a month
# window function is used to create `monthly` column. See in the generated SQL.
tbl_res <- tbl(con, "flights") %>% group_by(year, month, day) %>% count(name = "daily") %>%
  group_by(year, month) %>% mutate(monthly = sum(daily)) %>% mutate(dayily_ratio = daily/as.numeric(monthly))
tbl_res %>% head %>% collect()

## Warning: Missing values are always removed in SQL aggregation functions.
## Use `na.rm = TRUE` to silence this warning
## This warning is displayed once every 8 hours.

## # A tibble: 6 × 6
## # Groups:   year, month [1]
##    year month   day daily monthly dayily_ratio
##   <int> <int> <int> <int>   <int>        <dbl>
## 1  2013     1     1   842   27004       0.0312
## 2  2013     1     2   943   27004       0.0349
## 3  2013     1     3   914   27004       0.0338
## 4  2013     1     4   915   27004       0.0339
## 5  2013     1     5   720   27004       0.0267
## 6  2013     1     6   832   27004       0.0308

tbl_res %>% show_query()

## <SQL>
## SELECT *, `daily` / CAST(`monthly` AS REAL) AS `dayily_ratio`
## FROM (
##   SELECT *, SUM(`daily`) OVER (PARTITION BY `year`, `month`) AS `monthly`
##   FROM (
##     SELECT `year`, `month`, `day`, COUNT(*) AS `daily`
##     FROM `flights`
##     GROUP BY `year`, `month`, `day`
##   )
## )

# Adding a column with mean Sepal.Width to the original data.

# sqldf is cool package that allows you to write SQL on R tables.
# in the background it copies the data into SQLite, executes query,
# and copies the result back into R. I was using it more before dplyr and dbplyr
# times. But still I think that it's cool package:)
iris2 <- iris[c(1:3,51:53,101:103),]
sqldf("SELECT *, AVG(\"Sepal.Width\") OVER (PARTITION BY \"Species\") from iris2")

##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1          5.1         3.5          1.4         0.2     setosa
## 2          4.9         3.0          1.4         0.2     setosa
## 3          4.7         3.2          1.3         0.2     setosa
## 4          7.0         3.2          4.7         1.4 versicolor
## 5          6.4         3.2          4.5         1.5 versicolor
## 6          6.9         3.1          4.9         1.5 versicolor
## 7          6.3         3.3          6.0         2.5  virginica
## 8          5.8         2.7          5.1         1.9  virginica
## 9          7.1         3.0          5.9         2.1  virginica
##   AVG("Sepal.Width") OVER (PARTITION BY "Species")
## 1                                         3.233333
## 2                                         3.233333
## 3                                         3.233333
## 4                                         3.166667
## 5                                         3.166667
## 6                                         3.166667
## 7                                         3.000000
## 8                                         3.000000
## 9                                         3.000000

sqldf(
"SELECT 
    species, 
    \"Sepal.Width\", 
    SUM(\"Sepal.Width\") OVER (PARTITION BY species ORDER BY species ROWS UNBOUNDED PRECEDING) AS cusum,
    row_number() OVER (PARTITION BY species)
  FROM
    iris2")

##      Species Sepal.Width cusum row_number() OVER (PARTITION BY species)
## 1     setosa         3.5   3.5                                        1
## 2     setosa         3.0   6.5                                        2
## 3     setosa         3.2   9.7                                        3
## 4 versicolor         3.2   3.2                                        1
## 5 versicolor         3.2   6.4                                        2
## 6 versicolor         3.1   9.5                                        3
## 7  virginica         3.3   3.3                                        1
## 8  virginica         2.7   6.0                                        2
## 9  virginica         3.0   9.0                                        3

Recursive CTE

Recursive CTE are one of those things that very rarely comes in handy, so by the time you have to use it, you will have forgotten its syntax. From my experience, whenever you hear “We need to use a graph database for it!”, think about “recursive CTE”. They might not be the “perfect” tool for the job, but they might be performant enough.

db <- dbConnect(RSQLite::SQLite(), tempfile())

sql <- "
CREATE TABLE org(
  name TEXT PRIMARY KEY,
  boss TEXT REFERENCES org
) WITHOUT ROWID;
"
dbExecute(db, sql)

## [1] 0

values <- "INSERT INTO org VALUES('Alice',NULL);
INSERT INTO org VALUES('Bob','Alice');
INSERT INTO org VALUES('Cindy','Alice');
INSERT INTO org VALUES('Dave','Bob');
INSERT INTO org VALUES('Emma','Bob');
INSERT INTO org VALUES('Fred','Cindy');
INSERT INTO org VALUES('Gail','Cindy');
INSERT INTO org VALUES('Lindy',NULL);
INSERT INTO org VALUES('John','Lindy');
INSERT INTO org VALUES('Luke','John');
INSERT INTO org VALUES('Megan','Lindy');"
values <- stringr::str_split(values, pattern = "\n", simplify = TRUE) %>% as.character()
purrr::map(values, function(x) dbExecute(db, x)) %>% invisible()

dbGetQuery(conn = db, "
WITH RECURSIVE
  under_alice(name,level) AS (
    VALUES('Alice',0)
    UNION ALL
    SELECT org.name, under_alice.level+1
      FROM org JOIN under_alice ON org.boss=under_alice.name
     ORDER BY 2 DESC
  )
SELECT substr('..........',1,level*3) || name AS structure FROM under_alice;
")

##    structure
## 1      Alice
## 2     ...Bob
## 3 ......Dave
## 4 ......Emma
## 5   ...Cindy
## 6 ......Fred
## 7 ......Gail

dbGetQuery(conn = db, "
WITH RECURSIVE
  under_alice(name,level) AS (
    SELECT org.name, 0 as level from org where org.boss IS NULL
    UNION ALL
    SELECT org.name, under_alice.level+1
      FROM org JOIN under_alice ON org.boss=under_alice.name
     ORDER BY 2 DESC
  )
SELECT name || substr('..........',1,level*3) AS structure FROM under_alice;
")

##     structure
## 1       Alice
## 2      Bob...
## 3  Dave......
## 4  Emma......
## 5    Cindy...
## 6  Fred......
## 7  Gail......
## 8       Lindy
## 9     John...
## 10 Luke......
## 11   Megan...

Query explainer and query optimizer

Query explainer and query optimizer are two additional topics that are worthwhile to be aware of. In a nutshell, SQL engine before executing the query prepares an execution plan (e.g. which indexes are going to be used, in which order tables are going to be join, etc). If the query is too slow you might check the prepared plan to see what’s going on - e.g. you might notice that SQL engine is performing full table scan (reads all rows) rather than using an index.

As always query plan output is database specific, so all the knowledge gathered for one engine might not be transferable to the other one.

A good place to start is as always SQLite site: - https://www.sqlite.org/eqp.html - query explain. - https://www.sqlite.org/optoverview.html query optimizer.

con <- dbplyr::nycflights13_sqlite()

tbl_res <- tbl(con, "flights") %>% group_by(year, month, day) %>% count(name = "daily") %>%
  group_by(year, month) %>% mutate(monthly = sum(daily)) %>% mutate(dayily_ratio = daily/as.numeric(monthly))

query <- capture.output(tbl_res %>% show_query())
dbGetQuery(con, paste("EXPLAIN QUERY PLAN", paste(query[-1], collapse = "\n")))

##    id parent notused                                                   detail
## 1   2      0       0                                  CO-ROUTINE (subquery-2)
## 2   5      2       0                                  CO-ROUTINE (subquery-4)
## 3   7      5       0                                  CO-ROUTINE (subquery-1)
## 4  13      7       0 SCAN flights USING COVERING INDEX flights_year_month_day
## 5  46      5       0                                        SCAN (subquery-1)
## 6  59      5       0                             USE TEMP B-TREE FOR ORDER BY
## 7  79      2       0                                        SCAN (subquery-4)
## 8 139      0       0                                        SCAN (subquery-2)

Row vs columnar databases

There are many different kinds of databases - relational, SQL, noSQL, etc. One possible distinction might be a result of how the data is organized in the table. In row-oriented database all data associated with the same record are stored together, next to each other. This layout makes adding new records very fast. Contrary, in the columnar layout all data that belongs to the same column are stored next to each other. This approach makes the data much more friendly for the processor cache, which makes columnar databases more performant. Usually, for analytics you should use columnar databses.

The example below shows the comparison of the execution time between row-oriented database (SQLite) and columnar database (duckdb) for simeple aggregation query. No surprises, columnar database is faster.

con <- DBI::dbConnect(RSQLite::SQLite(), tempfile())

letter_column <- sample(LETTERS, replace = TRUE, size = 1e6)
value <- rnorm(n = length(letter_column), mean = as.integer(charToRaw(paste(letter_column, collapse = ""))))
letter_table <- tibble(letter = letter_column, value = value) %>% arrange(letter)

dbWriteTable(con, "letter", letter_table)
system.time(tbl(con, "letter") %>% group_by(letter) %>% summarise(mean(value)) %>% collect())

##    user  system elapsed 
##   0.436   0.025   0.461

library(duckdb)
con_duck <- dbConnect(duckdb(tempfile()))
dbWriteTable(con_duck, "letter", letter_table)
system.time(tbl(con_duck, "letter") %>% group_by(letter) %>% summarise(mean(value)) %>% collect())

##    user  system elapsed 
##   0.131   0.012   0.113

microbenchmark::microbenchmark(times = 10,
  tbl(con, "letter") %>% group_by(letter) %>% summarise(mean(value)) %>% collect(),
  tbl(con_duck, "letter") %>% group_by(letter) %>% summarise(mean(value)) %>% collect()
)

## Unit: milliseconds
##                                                                                        expr
##       tbl(con, "letter") %>% group_by(letter) %>% summarise(mean(value)) %>%      collect()
##  tbl(con_duck, "letter") %>% group_by(letter) %>% summarise(mean(value)) %>%      collect()
##        min        lq      mean    median       uq      max neval cld
##  410.55408 415.80349 423.26792 420.89233 428.5054 444.0424    10   b
##   88.92357  93.88219  99.39049  98.74699 104.6958 109.3912    10  a

DBI::dbExecute(con, "CREATE INDEX IF NOT EXISTS letters_index ON letter (letter)")

## [1] 0

DBI::dbExecute(con_duck, "CREATE INDEX IF NOT EXISTS letters_index ON letter (letter)")

## [1] 0

microbenchmark::microbenchmark(times = 10,
  tbl(con, "letter") %>% group_by(letter) %>% summarise(mean(value)) %>% collect(),
  tbl(con_duck, "letter") %>% group_by(letter) %>% summarise(mean(value)) %>% collect()
)

## Unit: milliseconds
##                                                                                        expr
##       tbl(con, "letter") %>% group_by(letter) %>% summarise(mean(value)) %>%      collect()
##  tbl(con_duck, "letter") %>% group_by(letter) %>% summarise(mean(value)) %>%      collect()
##        min       lq      mean    median        uq       max neval cld
##  204.44771 207.6635 214.65076 211.21876 218.63256 231.55214    10   b
##   86.25933  87.6635  90.16922  89.43304  92.84429  96.74997    10  a

Rendering SQL using R code.

When you programmatically create SQL in R, DO NOT USE paste or glue to fill parameters. Use glue_sql instead. It sanitizes the input, making it safe from different kinds of attacks that allows unauthorized person to access or destroy data in your tables.

important_table <- tibble(id = 1:5, value = LETTERS[1:5])

id <- 1
(sql <- glue::glue("SELECT * FROM important_table WHERE id = '{id}'"))

## SELECT * FROM important_table WHERE id = '1'

sqldf(sql)

##   id value
## 1  1     A

id <- "1' OR 1=1 --;"
(sql <- glue::glue("SELECT * FROM important_table WHERE id = '{id}'"))

## SELECT * FROM important_table WHERE id = '1' OR 1=1 --;'

sqldf(sql)

##   id value
## 1  1     A
## 2  2     B
## 3  3     C
## 4  4     D
## 5  5     E

Now let’s see how to destroy a table:

db <- dbConnect(RSQLite::SQLite(), tempfile())
dbWriteTable(db, "important_table", important_table)
tbl(db, "important_table")

## # Source:   table<important_table> [5 x 2]
## # Database: sqlite 3.40.1 [/tmp/RtmpZrU1d6/file23f83e3ac1feea]
##      id value
##   <int> <chr>
## 1     1 A    
## 2     2 B    
## 3     3 C    
## 4     4 D    
## 5     5 E

id <- 1
(sql <- glue::glue("UPDATE important_table SET value = 'XD' WHERE id = '{id}'"))

## UPDATE important_table SET value = 'XD' WHERE id = '1'

dbExecute(db, sql)

## [1] 1

tbl(db, "important_table")

## # Source:   table<important_table> [5 x 2]
## # Database: sqlite 3.40.1 [/tmp/RtmpZrU1d6/file23f83e3ac1feea]
##      id value
##   <int> <chr>
## 1     1 XD   
## 2     2 B    
## 3     3 C    
## 4     4 D    
## 5     5 E

# Destroy the whole table!
# https://xkcd.com/327/
id <- "1' OR 1=1 --;"
(sql <- glue::glue("UPDATE important_table SET value = 'XD' WHERE id = '{id}'"))

## UPDATE important_table SET value = 'XD' WHERE id = '1' OR 1=1 --;'

dbExecute(db, sql)

## [1] 5

tbl(db, "important_table")

## # Source:   table<important_table> [5 x 2]
## # Database: sqlite 3.40.1 [/tmp/RtmpZrU1d6/file23f83e3ac1feea]
##      id value
##   <int> <chr>
## 1     1 XD   
## 2     2 XD   
## 3     3 XD   
## 4     4 XD   
## 5     5 XD

## Now with glue_sql
db <- dbConnect(RSQLite::SQLite(), tempfile())
dbWriteTable(db, "important_table", important_table)
tbl(db, "important_table")

## # Source:   table<important_table> [5 x 2]
## # Database: sqlite 3.40.1 [/tmp/RtmpZrU1d6/file23f83e53a7f619]
##      id value
##   <int> <chr>
## 1     1 A    
## 2     2 B    
## 3     3 C    
## 4     4 D    
## 5     5 E

id <- "1"
(sql <- glue::glue_sql("UPDATE important_table SET value = 'XD' WHERE id = {id}", .con = db))

## <SQL> UPDATE important_table SET value = 'XD' WHERE id = '1'

dbExecute(db, sql)

## [1] 1

tbl(db, "important_table")

## # Source:   table<important_table> [5 x 2]
## # Database: sqlite 3.40.1 [/tmp/RtmpZrU1d6/file23f83e53a7f619]
##      id value
##   <int> <chr>
## 1     1 XD   
## 2     2 B    
## 3     3 C    
## 4     4 D    
## 5     5 E

# Table is safe!
id <- "1' OR 1=1 --;"
(sql <- glue::glue_sql("UPDATE important_table SET value = 'XD' WHERE id = {id}", .con = db))

## <SQL> UPDATE important_table SET value = 'XD' WHERE id = '1'' OR 1=1 --;'

dbExecute(db, sql)

## [1] 0

tbl(db, "important_table")

## # Source:   table<important_table> [5 x 2]
## # Database: sqlite 3.40.1 [/tmp/RtmpZrU1d6/file23f83e53a7f619]
##      id value
##   <int> <chr>
## 1     1 XD   
## 2     2 B    
## 3     3 C    
## 4     4 D    
## 5     5 E

Moral is short - use glue_sql or dbplyr or create SQL queries.

Conclusion

The list of different topics presented above is not exhaustive, probably not even in a correct order of importance. But it should be a good start.

zstat.pl - blog

Random thoughts on SQL and R