Developing a Hadoop-based Middleware for Handling Multi-dimensional NetCDF

Wednesday, 17 December 2014
Zhenlong Li1, Chaowei Phil Yang1, John L Schnase2, Daniel Duffy3 and Tsengdar J Lee4, (1)George Mason University Fairfax, Fairfax, VA, United States, (2)NASA Goddard Space Flight Cent, Greenbelt, MD, United States, (3)NASA Center for Climate Simulation, Greenbelt, MD, United States, (4)NASA, Burke, VA, United States
Climate observations and model simulations are collecting and generating vast amounts of climate data, and these data are ever-increasing and being accumulated in a rapid speed. Effectively managing and analyzing these data are essential for climate change studies. Hadoop, a distributed storage and processing framework for large data sets, has attracted increasing attentions in dealing with the Big Data challenge. The maturity of Infrastructure as a Service (IaaS) of cloud computing further accelerates the adoption of Hadoop in solving Big Data problems. However, Hadoop is designed to process unstructured data such as texts, documents and web pages, and cannot effectively handle the scientific data format such as array-based NetCDF files and other binary data format. In this paper, we propose to build a Hadoop-based middleware for transparently handling big NetCDF data by 1) designing a distributed climate data storage mechanism based on POSIX-enabled parallel file system to enable parallel big data processing with MapReduce, as well as support data access by other systems; 2) modifying the Hadoop framework to transparently processing NetCDF data in parallel without sequencing or converting the data into other file formats, or loading them to HDFS; and 3) seamlessly integrating Hadoop, cloud computing and climate data in a highly scalable and fault-tolerance framework.