IN43B-1735
A Columnar Storage Strategy with Spatiotemporal Index for Big Climate Data

Thursday, 17 December 2015
Poster Hall (Moscone South)
Fei Hu1, Michael K. Bowen2, Zhenlong Li1, John L Schnase3, Daniel Duffy4, Tsengdar J Lee5 and Chaowei Phil Yang1, (1)George Mason University Fairfax, Fairfax, VA, United States, (2)NASA Goddard Space Flight Center, Greenbelt, MD, United States, (3)NASA Goddard Space Flight Cent, Greenbelt, MD, United States, (4)NASA Center for Climate Simulation, Greenbelt, MD, United States, (5)NASA Headquarters, Washington, DC, United States
Abstract:
Large collections of observational, reanalysis, and climate model output data may grow to as large as a 100 PB in the coming years, so climate dataset is in the Big Data domain, and various distributed computing frameworks have been utilized to address the challenges by big climate data analysis. However, due to the binary data format (NetCDF, HDF) with high spatial and temporal dimensions, the computing frameworks in Apache Hadoop ecosystem are not originally suited for big climate data. In order to make the computing frameworks in Hadoop ecosystem directly support big climate data, we propose a columnar storage format with spatiotemporal index to store climate data, which will support any project in the Apache Hadoop ecosystem (e.g. MapReduce, Spark, Hive, Impala). With this approach, the climate data will be transferred into binary Parquet data format, a columnar storage format, and spatial and temporal index will be built and attached into the end of Parquet files to enable real-time data query. Then such climate data in Parquet data format could be available to any computing frameworks in Hadoop ecosystem. The proposed approach is evaluated using the NASA Modern-Era Retrospective Analysis for Research and Applications (MERRA) climate reanalysis dataset. Experimental results show that this approach could efficiently overcome the gap between the big climate data and the distributed computing frameworks, and the spatiotemporal index could significantly accelerate data querying and processing.