The R code to reproduce the results is available from the GitHub repo nanxstats/cran-file-exts.
When applied correctly, file extensions can be informative. They are the very first clue on handling a specific file without parsing the file content.
To properly capture and classify files in source R packages, I am interested in learning what file extensions are frequently used by R packages.
We can achieve this easily by downloading all R packages available from CRAN one at a time and collect the file extensions inside:
library("curl")
library("tools")
repo <- "https://cran.rstudio.com/"
db <- as.data.frame(available.packages(paste0(repo, "src/contrib/")), stringsAsFactors = FALSE)
pkgs <- db$Package
files <- paste0(pkgs, "_", db$Version, ".tar.gz")
links <- paste0(repo, "src/contrib/", files)
find_ext <- function(path) {
x <- unique(file_ext(untar(path, list = TRUE)))
x[!(x %in% "")]
}
for (i in seq_along(pkgs)) {
cat("Downloading", i, "/", length(pkgs), "\n")
curl_download(links[i], destfile = files[i])
x <- find_ext(files[i])
write(paste0(x, collapse = "\t"), file = "exts.txt", append = TRUE)
unlink(files[i])
}
Since this is very one-dimensional, we should look into the frequency table:
x <- readLines("exts.txt")
x <- tolower(unlist(strsplit(x, split = "\t")))
y <- sort(table(x), decreasing = TRUE)
It looks like we have 1,529 file extensions. It is also likely a heavy-tailed distribution, with 96% of all files designated 5% of the unique file types.
length(y)
#> [1] 1529
z <- y[y > 50L]
length(z) / length(y)
#> [1] 0.04905167
sum(z) / sum(y)
#> [1] 0.9611313
We can also cluster this frequency data with any one-dimensional data clustering algorithm such as the maximum homogeneity clustering, implemented in my R package oneclust. Say, we are interested in file extensions that appeared >=5 times:
library("oneclust")
eoi <- y[y > 4L]
cl <- oneclust(eoi, 4)
cl$cluster
#> [1] 4 4 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [33] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [65] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [97] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [129] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [161] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [193] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [225] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [257] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Create a table for everything and display it with the awesome DT:
df <- data.frame(
"ext" = names(eoi),
"mime" = mime::guess_type(paste0(".", names(eoi))),
"count" = as.vector(eoi),
"cluster" = dplyr::recode(cl$cluster, `1` = 4, `2` = 3, `3` = 2, `4` = 1)
)
After looking into the table, what is your interesting discovery?