GeoDataspaces: Simplifying Data Management Tasks with Globus

Wednesday, 17 December 2014: 5:45 PM
Tanu Malik, Kyle Chard, Roselyne B Tchoua and Ian Foster, University of Chicago, Chicago, IL, United States
Data and its management are central to modern scientific enterprise. Typically, geoscientists rely on observations and model output data from several disparate sources (file systems, RDBMS, spreadsheets, remote data sources). Integrated data management solutions that provide intuitive semantics and uniform interfaces, irrespective of the kind of data source are, however, lacking. Consequently, geoscientists are left to conduct low-level and time-consuming data management tasks, individually, and repeatedly for discovering each data source, often resulting in errors in handling.

In this talk we will describe how the EarthCube GeoDataspace project is improving this situation for seismologists, hydrologists, and space scientists by simplifying some of the existing data management tasks that arise when developing computational models. We will demonstrate a GeoDataspace, bootstrapped with “geounits”, which are self-contained metadata packages that provide complete description of all data elements associated with a model run, including input/output and parameter files, model executable and any associated libraries. Geounits link raw and derived data as well as associating provenance information describing how data was derived. We will discuss challenges in establishing geounits and describe machine learning and human annotation approaches that can be used for extracting and associating ad hoc and unstructured scientific metadata hidden in binary formats with data resources and models. We will show how geounits can improve search and discoverability of data associated with model runs. To support this model, we will describe efforts related towards creating a scalable metadata catalog that helps to maintain, search and discover geounits within the Globus network of accessible endpoints. This talk will focus on the issue of creating comprehensive personal inventories of data assets for computational geoscientists, and describe a publishing mechanism, which can be used to feed into national, international, or thematic discovery portals.