An efficient hierarchical compression scheme for array-oriented climate data

Wednesday, 17 December 2014
Siddhartha S Ghosh1, Allison H Baker2 and Haiying Xu2, (1)NCAR, Boulder, CO, United States, (2)National Center for Atmospheric Research, Boulder, CO, United States
High resolution climate models generate huge amounts of data. In recent days, with wide-spread availability of large computing resources, the amount of data is growing at an exponential rate. Archiving and curating these data for further research and analysis is a large challenge. Typically these data are stored as 4-byte real variables in the form of physical field variables in four dimensional spatio-temporal grids. Many of the existing lossy and non-lossy algorithms do not take full advantage of the uniform or nearly-uniform variation of these physical variables over spatio-temporal grids. We develop and implement an algorithm that compresses chunks of these physical variables in spatio-temporal domains to take advantage of this uniformity. One of the important features of this implementation is that it allows user to control the extent of spatio-temporal chunks, which can be tuned to maximize the compression factor. A second important feature is to be able to provide the capability to fine-tune the precision of the stored physical variables, according to the data type requirements. For example, we note that the climatological analysis involves averaging over field variables that in turn makes some insignificant bits of data irrelevant to the final result -- in our scheme users will be able to control the number of bits to be stored. In order to be able to compress further, we pass the stream through a lossless LZMA scheme. The amount of compression that can be achieved by this scheme depends on the uniformity of data and the required number of significant bits of information. We will report the compression factor that can be obtained when this package is applied for a wide range of climate variables with the constraint of keeping the climate analysis results statistically indistinguishable by following the analysis of Baker[1].

[1] A.H. Baker, H. Xu, J.M. Dennis, M.N. Levy, D. Nychka, S.A. Mickelson, J. Edwards, M. Vertenstein, A. Wegener, “A Methodology for Evaluating the Impact of Data Compression on Climate Simulation Data.” Proc. of the 23rd International ACM Symposium on High Performance Parallel and Distributed Computing (HPDC14), Vancouver, CA, 2014, pp. 203-214.