Large-Scale Data Collection Metadata Management at the National Computation Infrastructure

Monday, 15 December 2014: 5:30 PM
Jingbo Wang1, Ben James Kingston Evans1, Irina Bastrakova2, Gerry Ryder3, Julia Martin3, Daisy Duursma4, Kashif Gohar5, Tim Mackey6, Matt Paget7 and Guru Siddeswara8, (1)Australian National University, Canberra, ACT, Australia, (2)Geoscience Australia, Canberra, ACT, Australia, (3)Australian National Data Service, Canberra, Australia, (4)Macquarie University, Sydney, Australia, (5)Australian National University, Canberra, Australia, (6)Geoscience Australia, Canberra, Australia, (7)Commonwealth Scientific and Industrial Research Organisation - CSIRO, Canberra, Australia, (8)University of Queensland, St Lucia, Australia
Data Collection management has become an essential activity at the National Computation Infrastructure (NCI) in Australia. NCI’s partners (CSIRO, Bureau of Meteorology, Australian National University, and Geoscience Australia), supported by the Australian Government and Research Data Storage Infrastructure (RDSI), have established a national data resource that is co-located with high-performance computing. This paper addresses the metadata management of these data assets over their lifetime.

NCI manages 36 data collections (10+ PB) categorised as earth system sciences, climate and weather model data assets and products, earth and marine observations and products, geosciences, terrestrial ecosystem, water management and hydrology, astronomy, social science and biosciences. The data is largely sourced from NCI partners, the custodians of many of the national scientific records, and major research community organisations. The data is made available in a HPC and data-intensive environment - a ~56000 core supercomputer, virtual labs on a 3000 core cloud system, and data services. By assembling these large national assets, new opportunities have arisen to harmonise the data collections, making a powerful cross-disciplinary resource.

To support the overall management, a Data Management Plan (DMP) has been developed to record the workflows, procedures, the key contacts and responsibilities. The DMP has fields that can be exported to the ISO19115 schema and to the collection level catalogue of GeoNetwork. The subset or file level metadata catalogues are linked with the collection level through parent-child relationship definition using UUID. A number of tools have been developed that support interactive metadata management, bulk loading of data, and support for computational workflows or data pipelines.

NCI creates persistent identifiers for each of the assets. The data collection is tracked over its lifetime, and the recognition of the data providers, data owners, data generators and data aggregators are updated. A Digital Object Identifier is assigned using the Australian National Data Service (ANDS). Once the data has been quality assured, a DOI is minted and the metadata record updated. NCI’s data citation policy establishes the relationship between research outcomes, data providers, and the data.