The R code in this post is also available in this GitHub Gist.
chromote is an R package that allows one to automate tasks driven by web browsers. It works by providing an API to communicate with Chromium-based browsers via the Chrome DevTools Protocol (CDP). For example, CDP can help us load and print HTML pages to PDF files programmatically, similar to what one could do in the web browser GUI but with mouse clicks.
Programming with CDP potentially involves asynchronous programming — something I personally find really hard to write! To the rescue, the chromote readme gave some great examples. They demonstrated how to write principled async code using the promise construct via {promises} and chain them together, substantially improving code readability.
As an exercise, I wrote a function to create a tiny end-to-end workflow to print a URL to a PDF. The function calls the low-level CDP API via chromote, is flexible to customize, and relatively easy to reason about.
library("promises")
library("chromote")
#' Print HTML to PDF using chromote
#'
#' @param url Input URL
#' @param filename Output file name
#' @param wait_ If TRUE, run in synchronous mode,
#' otherwise, run in asynchronous mode.
#' @param ... Additional parameters for Page.printToPDF, see
#' <https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-printToPDF>
#' for possible options.
print_to_pdf <- function(url, filename = NULL, wait_ = FALSE, ...) {
if (is.null(filename)) {
filename <- url |>
gsub("^.*://", "", x = _) |>
gsub("/$", "", x = _) |>
fs::path_sanitize(replacement = "_") |>
paste0(".pdf")
}
b <- ChromoteSession$new()
p <-
{
b$Page$navigate(url, wait_ = FALSE)
} %...>%
{
b$Page$loadEventFired(wait_ = FALSE, timeout_ = 0.1)
} %...>%
{
b$Page$printToPDF(..., wait_ = FALSE)
} %...>%
{
.$data
} %...>%
{
outfile <- file(filename, "wb")
base64enc::base64decode(., output = outfile)
close(outfile)
} %...>%
{
message(filename)
} %>%
finally(~ b$close())
if (wait_) {
b$wait_for(p)
} else {
p
}
invisible(filename)
}
Note that there is already a
screenshot_pdf()
method defined in chromote.
It is a well-crafted wrapper for the
Page.printToPDF
method in CDP and is used to produce the PDF “screenshot” in webshot2.
Printing paged HTML to PDF
Since PDF is page-based, the function will work the best when printing HTML documents with intentionally “paged” layouts. For example, we can run it on a customized ioslides presentation and a pagedown book.
urls <- c(
"https://nanx.me/talks/reimagine-rpkgs/",
"https://pagedown.rbind.io/"
)
fn <- lapply(urls, print_to_pdf, printBackground = TRUE)
fn[[1]] |>
pdftools::pdf_info() |>
str()
#> List of 11
#> $ version : chr "1.4"
#> $ pages : int 11
#> $ encrypted : logi FALSE
#> $ linearized : logi FALSE
#> $ keys :List of 2
#> ..$ Creator : chr "Chromium"
#> ..$ Producer: chr "Skia/PDF m104"
#> $ created : POSIXct[1:1], format: "2022-08-18 23:40:57"
#> $ modified : POSIXct[1:1], format: "2022-08-18 23:40:57"
#> $ metadata : chr ""
#> $ locked : logi FALSE
#> $ attachments: logi FALSE
#> $ layout : chr "no_layout"
In the output, “Skia/PDF m104” means the PDF was produced using the
Skia PDF backend
in the Chromium-based browser (major version 104).
Notably, the function
pagedown::chrome_print()
has a similar purpose to print HTML into PDF using headless Chrome.
File URL support
It appears that our function would also support file URLs. However, for unknown reasons, regardless of whether the asynchronous mode is used, or even when a local HTTP server (e.g., servr) serves the HTML, printing a local page could throw a time out error:
Unhandled promise error: Chromote: timed out waiting for event Page.loadEventFired
It works better when set to run in synchronous mode and after loading remote URLs like the above two.
f <- "https://nanx.me/blog/post/r-readability-parser/example.html" |>
curl::curl_download(tempfile(fileext = ".html"))
print_to_pdf(
paste0("file://", normalizePath(f, winslash = "/")),
filename = "example.pdf",
wait_ = TRUE
)
A good mystery to solve! Please comment below if you have any ideas.