Kaiaulu
From a dissertation appendix to a collaboration tool.
Kaiaulu is a open source R package for mining software repositories. This page details some of the design motivations behind the tool more informally. A more formal exposition was published in ECSA 2021 LLCS. The pre-print is also available.
R Glue Code
Kaiaulu began as what is better described as ``R glue code’’. Given a path to a project git log, Kaiaulu would then perform a system call passing the data to Perceval’s git interface to transform a .git
folder into a .json
file. Kaiaulu would then read the .json
into memory and then change the .json
to a table. At first glance, this may look like the beginning of yet another horror story: Soon, Kaiaulu would be this amalgamation of scripts no poor soul would dare be tasked to use to collect data:
Instead of meeting this terrible end, I defined this very simple task in Kaiaulu as an R package. The definition of R packages follows a more structured convention, which means the code organization fits a common abstraction, and requires less explanation.
In this example, a parse_gitlog()
function was placed in R/parser.R
, a short Notebook vignettes/gitlog_showcase.Rmd
explained the function and showcased the data. But isn’t this obvious? Surprisingly, not so. Pimentel et al (MSR 2019) found that in Python projects containing Jupyter Notebooks, only 10.3% defined local imports (i.e. imports of modules defined in the repository directory).
Whitebox Pipelines and Data Exploration
Research is naturally an iterative process of revisions:
Revisions pic.twitter.com/QnvcpVZJpz— PHD Comics (@PHDcomics) August 29, 2020
These revisions may result not only from a research advisor’s feedback. It may be from a domain expert, a collaborator, a stakeholder, a publication peer-review, or even a student you are mentoring. In most cases, the interest is understanding the pipeline in plain English, not by reading code. Yet, surprisingly, MSR tools are not written in a manner to enable literate programming. The result are tables and explanations lost in e-mails or file name revisioning:
This is unfortunate, as many assumptions, which do not carry over from project to project, are only visible in code and overlooked during revisions. Because R packages rely on an API + Notebook architecture, Kaiaulu naturally supports literate programming. Kaiaulu takes this one step further with project configuration files, distilling project-specific information used in the analysis.
Notebooks also provide a more transparent pipeline, where anyone can understand one step of the pipeline at a time and explore the data transformations at every step of the way. Indeed, analysis discussion and collaboration now begin as a pull request in GitHub instead of e-mail threads until ready to be merged.
More than R Glue Code
While Kaiaulu started as means to facilitate obtaining data sources and using them on third-party tools, it has since grown to have its data representations and analysis pipelines. Refer to one of the Notebooks to get started.