Reproducible science via semantics and provenance for ecological data
Reproducitbility is critical for ecology and allied disciplines because of both the inherent complexity of the inference chain needed to advance ecology as well as the importance of ecological results to challenges important to society. At a minimum, scientific reproducibility requires the ability to locate, understand, and access the data and visualization products of analysis and modeling that form the basis for scientific conclusions. However, the majority of methods sections in journal publications are inadequate for reproducibility, and current data repositories generally support only minimal descriptions of data provenance and semantics. Through usability studies and design sessions, we designed and implemented new software for use by ecological scientists that enables tracking data inputs and outputs of analyses, storing and documenting software, and showing data derivation history for new data produced in synthesis and analysis. In addition, we extended the KNB Data Repository and the DataONE systems to support data annotations that clarify the semantics of measurements to aid in data discovery and interpretation.
Tracking provenance and semantics for heterogeneous data improves the accessibility of data, as it allows scientists to link data and analytical products to the manuscripts that present these results and to the computational processes that produced them. The DataONE provenance and semantics system now provides the ability to search for source and derived data products, as well as label those products with appropriate semantics. Within this system, each data product is linked to the source data from which it was derived, and to the computational processes which were used to transform it. In addition, each data object within the system can be labeled with an annotation that explicitly defines the semantics of the measurements in that object. We show that these software improvements provide a rich environment for understanding the context of ecological research conclusions, and that our new search and browse capabilities are an effective means to archive the complete computational record for ecological research studies.