PRQL (Pipelined Relational Query Language, pronounced “Prequel”) has attracted high attention, attracting more than 6,000 GitHub stars in the year since its arrival in early 2022.
PRQL can easily be tried in the playground, and it can also be run on R by prqlr, which was recently registered on CRAN.
And with the wonderful flexibility of knitr, the development version of prqlr allows us to run PRQL code blocks on Quarto documents!1
This document demonstrates the use of PRQL code blocks on Quarto and compares them to SQL code blocks2 and dplyr (and dbplyr)3.
Caution
PRQL is under active development and will likely undergo frequent breaking changes before reaching version 1.0.
This document is based on PRQL 0.4.1, the latest version at this time, which will probably be outdated in a few mounths!
Setup
As with the SQL code blocks that exists as a default feature of knitr, a connection to the database must be set up in R in order to use PRQL code blocks. In this document, I use an in-memory database of DuckDB that can be queried directly to Data Frames in R.
R
library(DBI)library(duckdb)library(prqlr)library(dplyr, warn.conflicts =FALSE)library(dbplyr, warn.conflicts =FALSE)# Generate some Data Frames from sample csv files.base_url<-"https://raw.githubusercontent.com/PRQL/prql/0.4.1/prql-compiler/tests/integration/data/chinook/{name}.csv"options(readr.show_col_types =FALSE)albums<-readr::read_csv(glue::glue(base_url, name ="albums"))artists<-readr::read_csv(glue::glue(base_url, name ="artists"))tracks<-readr::read_csv(glue::glue(base_url, name ="tracks"))# Connect to DuckDB in-memory database.con<-dbConnect(duckdb(), ":memory:")# Register Data Frames to the database.duckdb_register(con, "mtcars", mtcars)duckdb_register(con, "albums", albums)duckdb_register(con, "artists", artists)duckdb_register(con, "tracks", tracks)
Basic example
Note that only dplyr does the computation without duckdb, so the row names of the mtcars DataFrame are retained and the results are different.
WITH table_1 AS (SELECTSUM(unit_price) AS _expr_0,COUNT(*) AS _expr_1, album_idFROM tracksGROUPBY album_id),table_2 AS (SELECTSUM(table_1._expr_0) AS artist_price,SUM(table_1._expr_1) AS track_count, albums.artist_idFROM table_1JOIN albums ON table_1.album_id = albums.album_idGROUPBY albums.artist_id)SELECT artists.name, table_2.artist_price, table_2.track_count, table_2.artist_price / table_2.track_count AS avg_track_priceFROM table_2JOIN artists ON table_2.artist_id = artists.artist_idORDERBY table_2.artist_price DESCLIMIT3
pyprql allows PRQL code to be executed interactively on Jupyter by Jupyter Magic, so PRQL can also be executed on Quarto when using Jupyter engine.↩︎
The SQL queries written in the SQL code blocks are generated from PRQL queries by prql-compiler (on the playground).↩︎
I’m a big fan of dplyr, so I have high hopes for PRQL that can be written like dplyr!↩︎
Source Code
---title: Use PRQL executable code blocks on Quartodate: 2023-01-25date-modified: todayauthor: "@eitsupi"engine: knitrecho: trueformat: html: self-contained: true toc: true code-tools: true code-link: true df-print: kable page-layout: full---[PRQL](https://prql-lang.org/) (Pipelined Relational Query Language, pronounced "Prequel") has attracted high attention,attracting more than 6,000 GitHub stars in the year since its arrival in early 2022.PRQL can easily be tried in [the playground](https://prql-lang.org/playground/),and it can also be run on R by `{prqlr}`, which was recently registered on CRAN.And with the wonderful flexibility of `{knitr}`,the development version of `{prqlr}` allows us to run PRQL code blocks on Quarto documents![^pyprql][^pyprql]: [pyprql](https://github.com/prql/PyPrql) allows PRQL code to be executed interactivelyon Jupyter by Jupyter Magic,so PRQL can also be executed on Quarto when using Jupyter engine.This document demonstrates the use of PRQL code blocks on Quartoand compares them to SQL code blocks[^sql] and `{dplyr}` (and `{dbplyr}`)[^dplyr].[^sql]: The SQL queries written in the SQL code blocks are generated from PRQL queries by prql-compiler (on the playground).[^dplyr]: I'm a big fan of dplyr, so I have high hopes for PRQL that can be written like dplyr!:::{.callout-caution}## CautionPRQL is under active development and will likely undergo frequent breaking changesbefore reaching version 1.0.This document is based on PRQL 0.4.1, the latest version at this time,which will probably be outdated in a few mounths!:::## SetupAs with the SQL code blocks that exists as a default feature of `{knitr}`,a connection to the database must be set up in R in order to use PRQL code blocks.In this document, I use an in-memory database of DuckDB that can be queried directly to Data Frames in R.```{r filename="R"}library(DBI)library(duckdb)library(prqlr)library(dplyr, warn.conflicts =FALSE)library(dbplyr, warn.conflicts =FALSE)# Generate some Data Frames from sample csv files.base_url <-"https://raw.githubusercontent.com/PRQL/prql/0.4.1/prql-compiler/tests/integration/data/chinook/{name}.csv"options(readr.show_col_types =FALSE)albums <- readr::read_csv(glue::glue(base_url, name ="albums"))artists <- readr::read_csv(glue::glue(base_url, name ="artists"))tracks <- readr::read_csv(glue::glue(base_url, name ="tracks"))# Connect to DuckDB in-memory database.con <-dbConnect(duckdb(), ":memory:")# Register Data Frames to the database.duckdb_register(con, "mtcars", mtcars)duckdb_register(con, "albums", albums)duckdb_register(con, "artists", artists)duckdb_register(con, "tracks", tracks)```## Basic exampleNote that only `{dplyr}` does the computation without `{duckdb}`,so the row names of the `mtcars` DataFrame are retained and the results are different.:::{.panel-tabset}### PRQL```{prql filename="PRQL"}#| connection: confrom mtcarsfilter cyl > 6select [cyl, mpg]sort [-cyl]derive [mpg_int = round 0 mpg]take 3```### SQL```{sql filename="SQL"}--| connection: conSELECT cyl, mpg,ROUND(mpg, 0) AS mpg_intFROM mtcarsWHERE cyl >6ORDERBY cyl DESCLIMIT3```### dbplyr```{r filename="R"}tbl(con, "mtcars") |>filter(cyl >6) |>select(cyl, mpg) |>arrange(desc(cyl)) |>mutate(mpg_int =round(mpg, 0)) |>head(3) |>collect()```### dplyr```{r filename="R"}mtcars |>filter(cyl >6) |>select(cyl, mpg) |>arrange(desc(cyl)) |>mutate(mpg_int =round(mpg, 0)) |>head(3)```:::## Join multiple tablesWe can clearly see how similar PRQL and `{dplyr}` are for long queries involving joins.:::{.panel-tabset}### PRQL```{prql filename="PRQL"}#| connection: confrom tracksselect [album_id, name, unit_price]sort [-unit_price, name]group album_id ( aggregate [ track_count = count, album_price = sum unit_price ])join albums [==album_id]group artist_id ( aggregate [ track_count = sum track_count, artist_price = sum album_price ])join artists [==artist_id]select [artists.name, artist_price, track_count]sort [-artist_price]derive avg_track_price = artist_price / track_counttake 3```### SQL```{sql filename="SQL"}#| connection: conWITH table_1 AS (SELECTSUM(unit_price) AS _expr_0,COUNT(*) AS _expr_1, album_idFROM tracksGROUPBY album_id),table_2 AS (SELECTSUM(table_1._expr_0) AS artist_price,SUM(table_1._expr_1) AS track_count, albums.artist_idFROM table_1JOIN albums ON table_1.album_id = albums.album_idGROUPBY albums.artist_id)SELECT artists.name, table_2.artist_price, table_2.track_count, table_2.artist_price / table_2.track_count AS avg_track_priceFROM table_2JOIN artists ON table_2.artist_id = artists.artist_idORDERBY table_2.artist_price DESCLIMIT3```### dbplyr```{r filename="R"}tbl(con, "tracks") |>select(album_id, name, unit_price) |>window_order(desc(unit_price), name) |>group_by(album_id) |>summarise(track_count =n(),album_price =sum(unit_price, na.rm =TRUE),.groups ="drop" ) |>inner_join(tbl(con, "albums"), by ="album_id") |>group_by(artist_id) |>summarise(track_count =sum(track_count, na.rm =TRUE),artist_price =sum(album_price, na.rm =TRUE),.groups ="drop" ) |>inner_join(tbl(con, "artists"), by ="artist_id") |>select(name, artist_price, track_count) |>arrange(desc(artist_price)) |>mutate(avg_track_price = artist_price / track_count) |>head(3)```### dplyr```{r filename="R"}tracks |>select(album_id, name, unit_price) |>arrange(desc(unit_price), name) |>group_by(album_id) |>summarise(track_count =n(),album_price =sum(unit_price, na.rm =TRUE),.groups ="drop" ) |>inner_join(albums, by ="album_id") |>group_by(artist_id) |>summarise(track_count =sum(track_count, na.rm =TRUE),artist_price =sum(album_price, na.rm =TRUE),.groups ="drop" ) |>inner_join(artists, by ="artist_id") |>select(name, artist_price, track_count) |>arrange(desc(artist_price)) |>mutate(avg_track_price = artist_price / track_count) |>head(3)```:::```{r}#| echo: falsedbDisconnect(con, shutdown =TRUE)```## Session infoR packages used in this document are these.```{r filename="R"}sessioninfo::package_info() |>print()```