IN51B-1812
Evaluation of Big Data Containers for Popular Storage, Retrieval, and Computation Primitives in Earth Science Analysis
Friday, 18 December 2015
Poster Hall (Moscone South)
Kamalika Das, NASA - Ames Research Center, Mountain View, CA, United States, Thomas Clune, NASA Goddard Space Flight Center, Greenbelt, MD, United States, Kwo-Sen Kuo, Earth System Science Interdisciplinary Center, COLLEGE PARK, MD, United States, Chris A Mattmann, Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, United States, Thomas Huang, NASA Jet Propulsion Laboratory, Pasadena, CA, United States, Daniel Duffy, NASA Center for Climate Simulation, Greenbelt, MD, United States, Chaowei Phil Yang, George Mason University Fairfax, Fairfax, VA, United States, Ted Habermann, HDF Group, Champaign, IL, United States and AIST Data Container Study Team
Abstract:
Data containers are infrastructures that facilitate storage, retrieval, and analysis of data sets. Big data applications in Earth Science require a mix of processing techniques, data sources and storage formats that are supported by different data containers. Some of the most popular data containers used in Earth Science studies are Hadoop, Spark, SciDB, AsterixDB, and RasDaMan. These containers optimize different aspects of the data processing pipeline and are, therefore, suitable for different types of applications. These containers are expected to undergo rapid evolution and the ability to re-test, as they evolve, is very important to ensure the containers are up to date and ready to be deployed to handle large volumes of observational data and model output. Our goal is to develop an evaluation plan for these containers to assess their suitability for Earth Science data processing needs. We have identified a selection of test cases that are relevant to most data processing exercises in Earth Science applications and we aim to evaluate these systems for optimal performance against each of these test cases. The use cases identified as part of this study are (i) data fetching, (ii) data preparation for multivariate analysis, (iii) data normalization, (iv) distance (kernel) computation, and (v) optimization. In this study we develop a set of metrics for performance evaluation, define the specifics of governance, and test the plan on current versions of the data containers. The test plan and the design mechanism are expandable to allow repeated testing with both new containers and upgraded versions of the ones mentioned above, so that we can gauge their utility as they evolve.