A Disciplined Architectural Approach to Scaling Data Analysis for Massive, Scientific Data

Tuesday, 16 December 2014: 4:00 PM
Daniel J Crichton1, Amy J Braverman1, Luca Cinquini2, Michael Turmon3, Huikyo Lee4 and Emily Law1, (1)NASA Jet Propulsion Laboratory, Pasadena, CA, United States, (2)Jet Propulsion Laboratory, Broomfield, CO, United States, (3)JPL, Pasadena, CA, United States, (4)Jet Propulsion Laboratory, Caltech, ALTADENA, CA, United States
Data collections across remote sensing and ground-based instruments in astronomy, Earth science, and planetary science are outpacing scientists’ ability to analyze them. Furthermore, the distribution, structure, and heterogeneity of the measurements themselves pose challenges that limit the scalability of data analysis using traditional approaches. Methods for developing science data processing pipelines, distribution of scientific datasets, and performing analysis will require innovative approaches that integrate cyber-infrastructure, algorithms, and data into more systematic approaches that can more efficiently compute and reduce data, particularly distributed data. This requires the integration of computer science, machine learning, statistics and domain expertise to identify scalable architectures for data analysis.

The size of data returned from Earth Science observing satellites and the magnitude of data from climate model output, is predicted to grow into the tens of petabytes challenging current data analysis paradigms. This same kind of growth is present in astronomy and planetary science data.

One of the major challenges in data science and related disciplines defining new approaches to scaling systems and analysis in order to increase scientific productivity and yield. Specific needs include: 1) identification of optimized system architectures for analyzing massive, distributed data sets; 2) algorithms for systematic analysis of massive data sets in distributed environments; and 3) the development of software infrastructures that are capable of performing massive, distributed data analysis across a comprehensive data science framework.

NASA/JPL has begun an initiative in data science to address these challenges. Our goal is to evaluate how scientific productivity can be improved through optimized architectural topologies that identify how to deploy and manage the access, distribution, computation, and reduction of massive, distributed data, while managing the uncertainties of scientific conclusions derived from such capabilities. This talk will provide an overview of JPL’s efforts in developing a comprehensive architectural approach to data science.