IN51B-1815
SciDB versus Spark: A Preliminary Comparison Based on an Earth Science Use Case
Abstract:
We compare two Big Data technologies, SciDB and Spark, for performance, usability, and extensibility, when applied to a representative Earth science use case. SciDB is a new-generation parallel distributed database management system (DBMS) based on the array data model that is capable of handling multidimensional arrays efficiently but requires lengthy data ingest prior to analysis, whereas Spark is a fast and general engine for large scale data processing that can immediately process raw data files and thereby avoid the ingest process. Once data have been ingested, SciDB is very efficient in database operations such as subsetting. Spark, on the other hand, provides greater flexibility by supporting a wide variety of high-level tools including DBMS’s.For the performance aspect of this preliminary comparison, we configure Spark to operate directly on text or binary data files and thereby limit the need for additional tools. Arguably, a more appropriate comparison would involve exploring other configurations of Spark which exploit supported high-level tools, but that is beyond our current resources. To make the comparison as “fair” as possible, we export the arrays produced by SciDB into text files (or converting them to binary files) for the intake by Spark and thereby avoid any additional file processing penalties.
The Earth science use case selected for this comparison is the identification and tracking of snowstorms in the NASA Modern Era Retrospective-analysis for Research and Applications (MERRA) reanalysis data. The identification portion of the use case is to flag all grid cells of the MERRA high-resolution hourly data that satisfies our criteria for snowstorm, whereas the tracking portion connects flagged cells adjacent in time and space to form a snowstorm episode.
We will report the results of our comparisons at this presentation.