IN33C-3783:
Publishing Heterogeneous Multi-source Research Data with Rich Metadata and Round-Trip Provenance: Opportunities and Challenges as Data Services form an Ecosystem

Wednesday, 17 December 2014
Jim Myers, University of Michigan, Ann Arbor, MI, United States
Abstract:
Modern research, particularly in areas related to coupled systems, big data analyses, and model development/validation, involves heterogeneous data inputs and derived data products: data come from multiple services, processing is complex and variable, and the derived products don’t easily fit back into any of the source repositories. Significant progress is being made in supporting the assembly, management, and publication of such complex collections, thereby increasing data reproducibility. The Sustainable Environment – Actionable Data (SEAD) project within the U.S. National Science Foundation’s DataNet program is one such service. SEAD is typical of emerging best-practice in terms of managing individual data items (typically files), organized into collections and described using an open set of standard vocabularies, which are then published as rich compound research objects. Such objects minimally expose basic discovery metadata that can be used to register them with third-party search engines but they also maintain the rich descriptive information and provenance at the fine grained level. In the past year, the SEAD project team has worked to provide mechanisms to ingest data directly from other data services and to interact with applications, thereby maintaining provenance links that would otherwise be lost through a download/upload trip through a file system. Further, in coordination with other DataNet projects, we’ve worked to capture and display metadata obtainable from the source service, and to infer research-object-to-research-object provenance from file-level information to support enhanced discovery. These advances begin to realize the vision of researchers working across an ecosystem of data services, but they have also highlighted a range of practical and theoretical issues that arise when it becomes possible to track research activities across multiple research/publication/data lifecycles. This presentation will describe the work that has been done in SEAD and between DataNet projects to allow round-trip integration from data sources through subsequent data publications and will highlight the opportunities and challenges involved in creating richer, higher-value services that leverage the ‘round-trip’ provenance and metadata that becomes available.