From a dissertation appendix to a collaboration tool.

Kaiaulu is a open source R package for mining software repositories. This page details some of the design motivations behind the tool more informally. A more formal exposition was published in ECSA 2021 LLCS. The pre-print is also available.

R Glue Code

Kaiaulu began as what is better described as ``R glue code’’. Given a path to a project git log, Kaiaulu would then perform a system call passing the data to Perceval’s git interface to transform a .git folder into a .json file. Kaiaulu would then read the .json into memory and then change the .json to a table. At first glance, this may look like the beginning of yet another horror story: Soon, Kaiaulu would be this amalgamation of scripts no poor soul would dare be tasked to use to collect data:

XKCD Data Pipeline

Data Pipeline by XKCD

Instead of meeting this terrible end, I defined this very simple task in Kaiaulu as an R package. The definition of R packages follows a more structured convention, which means the code organization fits a common abstraction, and requires less explanation.

In this example, a parse_gitlog() function was placed in R/parser.R, a short Notebook vignettes/gitlog_showcase.Rmd explained the function and showcased the data. But isn’t this obvious? Surprisingly, not so. Pimentel et al (MSR 2019) found that in Python projects containing Jupyter Notebooks, only 10.3% defined local imports (i.e. imports of modules defined in the repository directory).

Whitebox Pipelines and Data Exploration

Research is naturally an iterative process of revisions:

These revisions may result not only from a research advisor’s feedback. It may be from a domain expert, a collaborator, a stakeholder, a publication peer-review, or even a student you are mentoring. In most cases, the interest is understanding the pipeline in plain English, not by reading code. Yet, surprisingly, MSR tools are not written in a manner to enable literate programming. The result are tables and explanations lost in e-mails or file name revisioning:

XKCD Data Pipeline

Documents by XKCD

This is unfortunate, as many assumptions, which do not carry over from project to project, are only visible in code and overlooked during revisions. Because R packages rely on an API + Notebook architecture, Kaiaulu naturally supports literate programming. Kaiaulu takes this one step further with project configuration files, distilling project-specific information used in the analysis.

Notebooks also provide a more transparent pipeline, where anyone can understand one step of the pipeline at a time and explore the data transformations at every step of the way. Indeed, analysis discussion and collaboration now begin as a pull request in GitHub instead of e-mail threads until ready to be merged.

More than R Glue Code

While Kaiaulu started as means to facilitate obtaining data sources and using them on third-party tools, it has since grown to have its data representations and analysis pipelines. Refer to one of the Notebooks to get started.