IN31C-1772
The Convergence of High Performance Computing and Large Scale Data Analytics
Abstract:
As the combinations of remote sensing observations and model outputs have grown, scientists are increasingly burdened with both the necessity and complexity of large-scale data analysis. Scientists are increasingly applying traditional high performance computing (HPC) solutions to solve their “Big Data” problems. While this approach has the benefit of limiting data movement, the HPC system is not optimized to run analytics, which can create problems that permeate throughout the HPC environment.To solve these issues and to alleviate some of the strain on the HPC environment, the NASA Center for Climate Simulation (NCCS) has created the Advanced Data Analytics Platform (ADAPT), which combines both HPC and cloud technologies to create an agile system designed for analytics. Large, commonly used data sets are stored in this system in a write once/read many file system, such as Landsat, MODIS, MERRA, and NGA. High performance virtual machines are deployed and scaled according to the individual scientist’s requirements specifically for data analysis.
On the software side, the NCCS and GMU are working with emerging commercial technologies and applying them to structured, binary scientific data in order to expose the data in new ways. Native NetCDF data is being stored within a Hadoop Distributed File System (HDFS) enabling storage-proximal processing through MapReduce while continuing to provide accessibility of the data to traditional applications. Once the data is stored within HDFS, an additional indexing scheme is built on top of the data and placed into a relational database. This spatiotemporal index enables extremely fast mappings of queries to data locations to dramatically speed up analytics. These are some of the first steps toward a single unified platform that optimizes for both HPC and large-scale data analysis, and this presentation will elucidate the resulting and necessary exascale architectures required for future systems.