Integrating the Apache Big Data Stack with HPC for Big Data

Tuesday, 16 December 2014: 2:00 PM
Geoffrey C Fox1, Judy Qiu1 and Shantenu Jha2, (1)Indiana University Bloomington, School of Informatics and Computing, Bloomington, IN, United States, (2)Rutgers University Newark, Department of Electrical and Computer Engineering, Newark, NJ, United States
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However, the same is not so true for data intensive computing, even though commercially clouds devote much more resources to data analytics than supercomputers devote to simulations.

We look at a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures. We suggest a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks and use these to identify a few key classes of hardware/software architectures. Our analysis builds on combining HPC and ABDS the Apache big data software stack that is well used in modern cloud computing. Initial results on clouds and HPC systems are encouraging.

We propose the development of SPIDAL - Scalable Parallel Interoperable Data Analytics Library -- built on system aand data abstractions suggested by the HPC-ABDS architecture. We discuss how it can be used in several application areas including Polar Science.