IN43D-07
Lowering the Barrier to Reproducible Research by Publishing Provenance from Common Analytical Tools

Thursday, 17 December 2015: 15:10
2020 (Moscone West)
Matthew B. Jones1, Peter Slaughter1, Lauren Walker1, Christopher S. Jones1, Paolo Missier2, Bertram Ludäscher3, Yang Cao3, Timothy McPhillips3 and Mark Schildhauer1, (1)National Center for Ecological Analysis and Synthesis, Santa Barbara, CA, United States, (2)Newcastle University, Newcastle Upon Tyne, United Kingdom, (3)University of Illinois at Urbana Champaign, Urbana, IL, United States
Abstract:
Scientific provenance describes the authenticity, origin, and processing history of research products and promotes scientific transparency by detailing the steps in computational workflows that produce derived products. These products include papers, findings, input data, software products to perform computations, and derived data and visualizations. The geosciences community values this type of information, and, at least theoretically, strives to base conclusions on computationally replicable findings. In practice, capturing detailed provenance is laborious and thus has been a low priority; beyond a lab notebook describing methods and results, few researchers capture and preserve detailed records of scientific provenance.

We have built tools for capturing and publishing provenance that integrate into analytical environments that are in widespread use by geoscientists (R and Matlab). These tools lower the barrier to provenance generation by automating capture of critical information as researchers prepare data for analysis, develop, test, and execute models, and create visualizations. The 'recordr' library in R and the `matlab-dataone` library in Matlab provide shared functions to capture provenance with minimal changes to normal working procedures. Researchers can capture both scripted and interactive sessions, tag and manage these executions as they iterate over analyses, and then prune and publish provenance metadata and derived products to the DataONE federation of archival repositories. Provenance traces conform to the ProvONE model extension of W3C PROV, enabling interoperability across tools and languages. The capture system supports fine-grained versioning of science products and provenance traces. By assigning global identifiers such as DOIs, reseachers can cite the computational processes used to reach findings. And, finally, DataONE has built a web portal to search, browse, and clearly display provenance relationships between input data, the software used to execute analyses and models, and derived data and products that arise from these computations. This provenance is vital to interpretation and understanding of science, and provides an audit trail that researchers can use to understand and replicate computational workflows in the geosciences.