IN31C-1779
Reproducibility and Knowledge Capture Architecture for the NASA Earth Exchange (NEX)
Wednesday, 16 December 2015
Poster Hall (Moscone South)
Petr Votava, NASA Ames Research Center, Moffett Field, CA, United States; University Corporation at Monterey Bay, Seaside, CA, United States and NASA Earth Exchange (NEX)
Abstract:
NASA Earth Exchange (NEX) is a data, supercomputing and knowledge collaboratory that houses NASA satellite, climate and ancillary data where a focused community can come together to address large-scale challenges in Earth sciences. As NEX has been growing into a platform for analysis, experiments and production of data on the order of petabytes in volume, it has been increasingly important to enable users to easily retrace their steps, identify what datasets were produced by which process or chain of processes, and give them ability to readily reproduce their results. This can be a tedious and difficult task even for a small project, but is almost impossible on large processing pipelines. For example, the NEX Landsat pipeline is deployed to process hundreds of thousands of Landsat scenes in a non-linear production workflow with many-to-many mappings of files between 40 separate processing stages where over 100 million processes get executed. At this scale it is almost impossible to easily examine the entire provenance of each file, let alone easily reproduce it. We have developed an initial solution for the NEX system - a transparent knowledge capture and reproducibility architecture that does not require any special code instrumentation and other actions on user’s part. Users can automatically capture their work through a transparent provenance tracking system and the information can subsequently be queried and/or converted into workflows. The provenance information is streamed to a MongoDB document store and a subset is converted to an RDF format and inserted into our triple-store. The triple-store already contains semantic information about other aspects of the NEX system and adding provenance enhances the ability to relate workflows and data to users, locations, projects and other NEX concepts that can be queried in a standard way. The provenance system has the ability to track data throughout NEX and across number of executions and can recreate and re-execute the entire history and reproduce the results. The information can also be used to automatically create individual workflow components and full workflows that can be visually examined, modified, executed and extended by researchers. This provides a key component for accelerating research through knowledge capture and scientific reproducibility on NEX.