Rosetta: Ensuring the Preservation and Usability of ASCII-based Data into the Future

Monday, 14 December 2015
Poster Hall (Moscone South)
Sean Cody Arms and Mohan K Ramamurthy, University Corporation for Atmospheric Research, Unidata, Boulder, CO, United States
Field data obtained from dataloggers often take the form of comma separated value (CSV) ASCII text files. While ASCII based data formats have positive aspects, such as the ease of accessing the data from disk and the wide variety of tools available for data analysis, there are some drawbacks, especially when viewing the situation through the lens of data interoperability and stewardship.

The Unidata data translation tool, Rosetta, is a web-based service that provides an easy, wizard-based interface for data collectors to transform their datalogger generated ASCII output into Climate and Forecast (CF) compliant netCDF files following the CF-1.6 discrete sampling geometries. These files are complete with metadata describing what data are contained in the file, the instruments used to collect the data, and other critical information that otherwise may be lost in one of many README files. The choice of the machine readable netCDF data format and data model, coupled with the CF conventions, ensures long-term preservation and interoperability, and that future users will have enough information to responsibly use the data. However, with the understanding that the observational community appreciates the ease of use of ASCII files, methods for transforming the netCDF back into a CSV or spreadsheet format are also built-in.

One benefit of translating ASCII data into a machine readable format that follows open community-driven standards is that they are instantly able to take advantage of data services provided by the many open-source data server tools, such as the THREDDS Data Server (TDS). While Rosetta is currently a stand-alone service, this talk will also highlight efforts to couple Rosetta with the TDS, thus allowing self-publishing of thoroughly documented datasets by the data producers themselves.