IN13C-1853
Enabling dynamic access to dynamic petascale Earth Systems and Environmental data collections is easy: citing and reproducing the actual data extracts used in research publications is NOT

Monday, 14 December 2015
Poster Hall (Moscone South)
Jingbo Wang1, Kelsey A Druken2, Ben James Kingston Evans1, Claire Trenham1 and Lesley A Wyborn2, (1)Australian National University, Canberra, Australia, (2)Australian National University, Canberra, ACT, Australia
Abstract:
The National Computational Infrastructure (NCI) at the Australian National University (ANU) has collocated over 10 PB of national and international Earth Systems and Environmental data assets within a HPC facility to create the National Environmental Research Data Interoperability Platform (NERDIP). Data are replicated to, or are produced at, NCI: in many cases they are processed to higher-level data products. Individual data sets within these collections can range from multi-petabyte climate models and large volume raster arrays, down to gigabyte size, ultra-high resolution data sets.

All data are quality assured to being ‘published’ and made accessible as services. Persistent identifiers are assigned during publishing at both the collection and data set level: the granularity and version control on persistent identifiers depend on the dataset.

However, most NERDIP collections are dynamic: either new data is being appended, or else models/derivative products are being revised with new data, or changed as processing methods are improved. Further, because the data are accessible as services, researchers can log in and dynamically create user-defined subsets for specific research projects: inevitably such extracts underpin traditional ‘publications’. Being able to reproduce these exact data extracts can be difficult and for the very larger data sets preserving a copy of large data extracts is out of the question.

A solution is for the researcher to use provenance workflows that at a minimum capture the version of the data set used, the query and the time of extraction. In parallel, the data provider needs to implement version controls on the data and deploy tracking systems that time stamp when new data are appended, or when modifications are made to existing data and record what these changes are. Where, when and how persistent identifiers are minted on these large and dynamically changing data sets is still open to debate.