SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics

Wilson, Brian

Abstract:

Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We are developing a lightning fast Big Data technology called SciSpark based on Apache^TMSpark. Spark implements the map-reduce paradigm for parallel computing on a cluster, but emphasizes in-memory computation, “spilling” to disk only as needed, and so outperforms the disk-based Apache^TM Hadoop by 100x in memory and by 10x on disk, and makes iterative algorithms feasible.

SciSpark will enable scalable model evaluation by executing large-scale comparisons of A-Train satellite observations to model grids on a cluster of 100 to 1000 compute nodes. This 2^nd generation capability for NASA’s Regional Climate Model Evaluation System (RCMES) will compute simple climate metrics at interactive speeds, and extend to quite sophisticated iterative algorithms such as machine-learning (ML) based clustering of temperature PDFs, and even graph-based algorithms for searching for Mesocale Convective Complexes.

The goals of SciSpark are to: (1) Decrease the time to compute comparison statistics and plots from minutes to seconds; (2) Allow for interactive exploration of time-series properties over seasons and years; (3) Decrease the time for satellite data ingestion into RCMES to hours; (4) Allow for Level-2 comparisons with higher-order statistics or PDF’s in minutes to hours; and (5) Move RCMES into a near real time decision-making platform.

We will report on: the architecture and design of SciSpark, our efforts to integrate climate science algorithms in Python and Scala, parallel ingest and partitioning (sharding) of A-Train satellite observations from HDF files and model grids from netCDF files, first parallel runs to compute comparison statistics and PDF’s, and first metrics quantifying parallel speedups and memory & disk usage.

2014 AGU Fall Meeting

December 15 - 19, 2014

IN51A-3772:

SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics

Abstract:

IN51A-3772: SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics

Abstract:

IN51A-3772:

SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics