Parsing Human-Readable Text Data from the Web with Readability.js and R

Nan Xiao 2022-08-02 6 min read

The R and JavaScript code to reproduce the results in this post is available from https://github.com/nanxstats/r-readability-parser.

Photo by Nick Hillier.

Readability.js

Maybe you have used tools like rvest to harvest text data from web pages. Naturally, this often requires elaborated human efforts in the front to understand the structure of the target website.

The picture looks quite different when we think at the web scale. To parse the content of many more sites and many more types of pages, we need to make our tool adaptive enough to extract the most relevant text instead of purely relying on manually crafted logic. We might tolerate missing some useful text and including some irrelevant text, which is acceptable because they probably won’t matter when the text data we collect is big enough.

Fortunately, Readability.js offers a tool for parsing human-readable text from any web page. It was built for the Reader View feature in Firefox but is also usable as an open source, standalone JavaScript library.

In this post, I will create an R wrapper for Readability.js using the R package V8.

Packing JS dependencies

Before we write the wrapper, the first step is identifying and packing the JavaScript dependencies to run in the V8 engine. The three key dependencies are @mozilla/readability, jsdom, and dompurify.

Following the vignette using NPM packages in V8, we pack them as follows.

brew install node
npm install -g browserify

Pack Readability.js:

npm install @mozilla/readability
echo "window.Readability = require('@mozilla/readability');" > in.js
browserify in.js -o readability.js

Pack jsdom for converting HTML into operable DOM document objects:

npm install jsdom
echo "window.jsdom = require('jsdom');" > in.js
browserify in.js -o jsdom.js

Pack DOMPurify mentioned in the readability.js security recommendation for sanitizing output to avoid script injection:

npm install dompurify
echo "window.dompurify = require('dompurify');" > in.js
browserify in.js -o dompurify.js

Writing an R binding

We will write some wrapper JavaScript functions to implement the workflow that uses all three JS libraries above.

function readabilityParser(html, url, candidates, threshold) {
  // Parse jsdom with readability.js
  let doc = new jsdom.JSDOM(
    html,
    { url: url }
  );
  let reader = new Readability.Readability(
    doc.window.document,
    { nbTopCandidates: candidates, charThreshold: threshold }
  );
  let res = reader.parse();

  // Sanitize results to avoid script injection
  const purifyWindow = new jsdom.JSDOM('').window;
  const DOMPurify = dompurify(purifyWindow);

  let clean = DOMPurify.sanitize(res.content);
  res.content = clean;

  return res;
}

function isReadable(html, min_content_length, min_score) {
  let doc = new jsdom.JSDOM(html);
  return Readability.isProbablyReaderable(
    doc.window.document,
    { minContentLength: min_content_length, minScore: min_score }
  );
}

The R wrapper is quite straightforward if you follow the V8 introduction vignette. As is suggested, the interactive JavaScript console via ct$console() is both fun and useful to play with when debugging.

readability <- function(html, url, candidates = 5L, threshold = 500L) {
  ct <- V8::v8(global = "window")

  ct$eval("function setTimeout(){}")
  ct$eval("function clearInterval(){}")
  ct$source("js/encoding.min.js")
  ct$source("js/jsdom.js")
  ct$source("js/dompurify.js")
  ct$source("js/readability.js")
  ct$eval(readLines("js/readability-parser.js"))

  # ct$get(V8::JS("Object.keys(window)"))
  ct$call("readabilityParser", html, url, candidates, threshold)
}

is_readable <- function(html, min_content_length = 140, min_score = 20) {
  ct <- V8::v8(global = "window")

  ct$eval("function setTimeout(){}")
  ct$eval("function clearInterval(){}")
  ct$source("js/encoding.min.js")
  ct$source("js/jsdom.js")
  ct$source("js/readability.js")
  ct$eval(readLines("js/readability-parser.js"))

  # ct$get(V8::JS("Object.keys(window)"))
  ct$call("isReadable", html, min_content_length, min_score)
}

Example

Let’s parse a recipe page (pasta with caramelized peppers, anchovies, and ricotta) from NYT Cooking.

Check if it is likely that the page is suitable for readability parsing:

url <- "https://cooking.nytimes.com/recipes/1021246-pasta-with-caramelized-peppers-anchovies-and-ricotta"

html <- url |>
  rvest::read_html() |>
  as.character()

html |> is_readable()

#> [1] TRUE

We can get the title and the clean, plain text corpus, usable for downstream text data modeling:

lst <- html |> readability(url = url)
cat(lst$title)

#> Pasta With Caramelized Peppers, Anchovies and Ricotta Recipe

lst$textContent |>
  gsub("\\n", " ", x = _, perl = TRUE) |>
  gsub("^\\s+|\\s+$|\\s+(?=\\s)", "", x = _, perl = TRUE) |>
  stringr::str_wrap(width = 74) |>
  cat()

Click here to expand the output

#> Ingredients Kosher salt 12 ounces short pasta, such as radiatori,
#> fusilli or campanelle 3 tablespoons extra-virgin olive oil, plus more
#> for drizzling 8 to 10 anchovy fillets, chopped, or use a dash or two of
#> soy sauce 2 large rosemary sprigs 6 garlic cloves, smashed and peeled
#> Large pinch of red-pepper flakes 2 sweet bell peppers (red, orange
#> or yellow), thinly sliced 2 tablespoons dry red, white or rosé wine,
#> or use dry vermouth or water 1 tablespoon unsalted butter Fresh lemon
#> juice ½ cup fresh ricotta 2 scallions, thinly sliced, or 1/4 cup sliced
#> red onion Freshly ground black pepper ¼ cup finely chopped fresh mint,
#> basil or thyme, plus torn mint or basil leaves and tender sprigs, for
#> garnish Freshly grated Parmesan (optional) Preparation Bring a large pot
#> of heavily salted water to a boil. Add the pasta and cook, according to
#> package instructions, until the pasta is just al dente. As pasta cooks,
#> heat a large sauté pan over medium-high, and add 3 tablespoons olive oil.
#> When the oil is hot, add the anchovies and rosemary, and sauté until the
#> anchovies start to dissolve, about 1 minute. Add the garlic and red-pepper
#> flakes, and sauté until the garlic turns pale golden in spots, about 1 to
#> 2 minutes. Add the bell peppers and a large pinch of salt to the pan, and
#> sauté until the bell peppers are very soft and well caramelized, 10 to 15
#> minutes, lowering the heat if the peppers start becoming too dark. Add the
#> wine (or water) and the butter, and sauté, scraping up the browned bits on
#> the bottom of the pan. Taste and season with lemon juice and more salt as
#> needed. Put 1/4 cup ricotta and the scallions in a large serving bowl, and
#> season aggressively with black pepper. Use a coffee mug or measuring cup
#> to scoop about 1/2 cup pasta water from the pot. Drain the pasta, then add
#> it to the bowl with the ricotta and scallions, tossing well. Add the bell
#> pepper mixture and the herbs, and toss well, adding a splash or two of
#> pasta water if the mixture looks dry. Taste and season with more salt if
#> needed. Spoon pasta into bowls, and top with dollops of the remaining 1/4
#> cup ricotta, a drizzle of oil and a little Parmesan, if you like. Shower
#> torn herb leaves over all.

We also got the clean HTML that preserves more structural information. We can process it further, for example, using xml2 or pandoc.

lst$content |>
  htmltools::HTML() |>
  htmltools::browsable()

You can preview the clean HTML here.

Common issues

I encountered and resolved two common issues when using the JS libraries.

TextEncoder is not defined

I used the hints here and saved text-encoding explicitly as another dependency. Doing this will eliminate the error ReferenceError: TextEncoder is not defined when sourcing jsdom.js with ct$source().

setTimeout/clearInterval is not defined

It seems some web APIs are not available in the V8 standard library. I followed the suggestions here and defined stubs for setTimeout() and clearInterval() to avoid errors like ReferenceError: setTimeout is not defined when running jsdom.