ASDF: A New Adaptable Data Format for Seismology Suitable for Large-Scale Workflows

Thursday, 18 December 2014
Lion Krischer1, James A Smith2, Alessandro Spinuso3 and Jeroen Tromp2, (1)Ludwig Maximilian University of Munich, Munich, Germany, (2)Princeton University, Princeton, NJ, United States, (3)Royal Netherlands Meteorological Institute, De Bilt, 3730, Netherlands
Increases in the amounts of available data as well as computational power opens the possibility to tackle ever larger and more complex problems. This comes with a slew of new problems, two of which are the need for a more efficient use of available resources and a sensible organization and storage of the data. Both need to be satisfied in order to properly scale a problem and both are frequent bottlenecks in large seismic inversions using ambient noise or more traditional techniques.

We present recent developments and ideas regarding a new data format, named ASDF (Adaptable Seismic Data Format), for all branches of seismology aiding with the aforementioned problems. The key idea is to store all information necessary to fully understand a set of data in a single file. This enables the construction of self-explaining and exchangeable data sets facilitating collaboration on large-scale problems. We incorporate the existing metadata standards FDSN StationXML and QuakeML together with waveform and auxiliary data into a common container based on the HDF5 standard. A further critical component of the format is the storage of provenance information as an extension of W3C PROV, meaning information about the history of the data, assisting with the general problem of reproducibility.

Applications of the proposed new format are numerous. In the context of seismic tomography it enables the full description and storage of synthetic waveforms including information about the used model, the solver, the parameters, and other variables that influenced the final waveforms. Furthermore, intermediate products like adjoint sources, cross correlations, and receiver functions can be described and most importantly exchanged with others.

Usability and tool support is crucial for any new format to gain acceptance and we additionally present a fully functional implementation of this format based on Python and ObsPy. It offers a convenient way to discover and analyze data sets as well as making it trivial to execute processing functionality on modern high performance machines utilizing parallel I/O even for users not familiar with the details. An open-source development and design model as well as a wiki aim to involve the community.