Sharing Site-Based Research Data: Standardizing and Packaging for Reuse

Thursday, 18 December 2014
Timothy DiLauro1, Jacob G. Jett2, Sean Gordon2 and Andrea Thomer2, (1)Johns Hopkins University, Baltimore, MD, United States, (2)University of Illinois at Urbana Champaign, Urbana, IL, United States
One of the key aims of the Institute of Museum and Library Services-funded Site-Based Data Curation (SBDC) Project[1] is to increase the reuse of data gathered or generated through research at sites like Yellowstone National Park (YNP) by improving its usefulness, discoverability, and accessibility. Toward this goal, SBDC worked closely with a geobiologist conducting fieldwork at YNP to explore existing data practices and held a two-day stakeholders workshop at the park with some of the scientists who study it and the National Park Service (NPS) staff who support research activities there.

The resulting workshop report[2] recommends, among other things, improvements to the level of detail and consistency of documentation of data and of its sampling and analysis methods. A set of core metadata elements and domain-specific extension elements is proposed (Appendix 9) to provide a more coherent view into the data.

Armed with these findings, we are pursuing approaches that will reduce the effort, complexity, and risk tied to adoption of these recommendations. During our investigation, we discovered the EarthChem templates[3], into which we began mapping the geobiologist’s data. We find the Vent Fluids template particularly appropriate and adaptable, as many of the high-interest features at YNP are shallow water vents. We are currently building an EarthChem-compatible template that will capture the environmental context of microbes, tracing their identities from water sample through to GenBank entry.

Given the variety of potential targets (e.g., site, institutional, and domain repositories; visualization and presentation tools), we decided to record the data in a structured package, which we can transform for a given target. We are using the Data Conservancy’s Packaging Tool[4], which provides an intuitive file system view, stores file checksums, and serializes a graph of relationships. This permits a researcher to conveniently group desired data products into a single self-documenting TAR or Zip file. Initial target repositories are NPS’s IRMA, Data Conservancy, and SEAD/Medici.