Retracing our steps in the analysis of data
The ability to replicate results is fundamental to all empirical sciences. It should be possible, for example, to repeat an experiment in the field or laboratory, collect new data, and get results that are comparable to the original results. It should also be possible to apply the same analysis to the original data and get results that are identical to the original results. (Or, more precisely, “nearly-identical,” for analyses performed with a computer, since computers are physical devices subject to empirical limitations and for various reasons may not generate exactly the same answer). Replication of a data analysis requires a detailed derivation history of exactly what happened, including a complete record of all data artifacts and processes used along the way. Few if any workflow or scripting environments available today capture all of this information (also known as provenance), while most software used for data analysis is optimized for performance and ease of use, with limited ability to track what was done. Finding a solution to this challenging problem requires addressing both theoretical issues (what information needs to be collected to fully describe a data analysis and ensure its repeatability) and practical issues (how to build a system that works for scientists but does not unduly affect system performance).
Our research has focused on developing methods for creating provenance and storing it in a database as a data analysis executes. The provenance can be represented mathematically as a directed acyclic graph (or DAG) in which the nodes are data or process instances (e.g. a particular set of numerical values or a particular R script) and the directed edges indicate the derivation history (e.g. the lineage of data artifacts or the lineage of control and execution steps). The resulting structure (which we call a Data Derivation Graph or DDG) can then be queried to investigate various questions that the scientist may have about a particular data analysis (e.g. which input values were used to calculate a certain output value, or which output values were possibly corrupted by a bad input value). The DDG offers a promising solution to the problem of replicating data analyses, but because of its size and complexity many challenges remain for how best to store, query, and visualize the DDG.