Developing a Hadoop-based Middleware for Handling Multi-dimensional NetCDF

Li, Zhenlong

Abstract:

Climate observations and model simulations are collecting and generating vast amounts of climate data, and these data are ever-increasing and being accumulated in a rapid speed. Effectively managing and analyzing these data are essential for climate change studies. Hadoop, a distributed storage and processing framework for large data sets, has attracted increasing attentions in dealing with the Big Data challenge. The maturity of Infrastructure as a Service (IaaS) of cloud computing further accelerates the adoption of Hadoop in solving Big Data problems. However, Hadoop is designed to process unstructured data such as texts, documents and web pages, and cannot effectively handle the scientific data format such as array-based NetCDF files and other binary data format. In this paper, we propose to build a Hadoop-based middleware for transparently handling big NetCDF data by 1) designing a distributed climate data storage mechanism based on POSIX-enabled parallel file system to enable parallel big data processing with MapReduce, as well as support data access by other systems; 2) modifying the Hadoop framework to transparently processing NetCDF data in parallel without sequencing or converting the data into other file formats, or loading them to HDFS; and 3) seamlessly integrating Hadoop, cloud computing and climate data in a highly scalable and fault-tolerance framework.

2014 AGU Fall Meeting

December 15 - 19, 2014

IN31B-3727:

Developing a Hadoop-based Middleware for Handling Multi-dimensional NetCDF

Abstract:

IN31B-3727: Developing a Hadoop-based Middleware for Handling Multi-dimensional NetCDF

Abstract:

IN31B-3727:

Developing a Hadoop-based Middleware for Handling Multi-dimensional NetCDF