PS 43-148 - Using data provenance tools to create reliable R scripts

Wednesday, August 9, 2017
Exhibit Hall, Oregon Convention Center
Emery R. Boose1, Aaron M. Ellison1, Elizabeth Fong2, Matthew K Lau1, Barbara S. Lerner2, Thomas Pasquier3 and Margo Seltzer3, (1)Harvard Forest, Harvard University, Petersham, MA, (2)Computer Science, Mt. Holyoke College, South Hadley, MA, (3)Computer Science, Harvard University, Cambridge, MA
Background/Question/Methods

Understanding and replicating a data analysis requires access to the original data and software used. But that may not be enough. For example, for an analysis done in R, it may not be possible to understand or execute the original script (perhaps because of poor documentation or code rot) or to replicate inputs that were generated at runtime (e.g. data downloaded from the web). A solution to this problem lies in data provenance, the precise history of a digital artifact from the point of its creation to its present state. Since few (if any) workflow or scripting environments today capture data provenance, it has had little impact to date on improving the transparency, reliability, and reproducibility of scientific results.

In this project we are developing software tools to make data provenance available to users of the R statistical language, which is widely used by ecologists and environmental scientists for data analysis and visualization. Our tools include RDataTracker, an R package that collects data provenance in the form of a Data Derivation Graph (or DDG) as an R script executes, and DDG Explorer, a separate tool used to visualize and query the resulting DDG. The ultimate goal of our project is to provide an end-to-end system that integrates provenance tools for R, Python, and operating systems in a common framework with common tools.

Results/Conclusions

Our tools automatically capture data provenance as an R script executes and allow the user to query the resulting DDG in various ways, e.g. to see how a particular data value was derived or used or what lines of source code correspond to a particular step. In the DDG one can see at a glance which branch of an if-else statement was executed, how many times a loop was run, and the inputs and outputs of a particular function call, as well as the execution time for each step. Information about the hardware and software environment, as well as installed R packages and sourced scripts, is also collected.

One striking benefit of these tools is support for script debugging. By setting a breakpoint, one can see in the DDG all the data values and steps up to the breakpoint. This feature removes the need (for example) to insert print statements and rerun the script and also ensures that the user sees the correct intermediate values for that particular execution of the script.