Using Analytics to Support Petabyte-Scale Science on the NASA Earth Exchange (NEX)

Friday, 19 December 2014
Petr Votava1,2, Andrew Michaelis1,2, Sangram Ganguly1,3 and Ramakrishna R Nemani1, (1)NASA Ames Research Center, Moffett Field, CA, United States, (2)University Corporation at Monterey Bay, Seaside, CA, United States, (3)Bay Area Environmental Research Institute Moffett Field, Moffett Field, CA, United States
NASA Earth Exchange (NEX) is a data, supercomputing and knowledge collaboratory that houses NASA satellite, climate and ancillary data where a focused community can come together to address large-scale challenges in Earth sciences. Analytics within NEX occurs at several levels – data, workflows, science and knowledge. At the data level, we are focusing on collecting and analyzing any information that is relevant to efficient acquisition, processing and management of data at the smallest granularity, such as files or collections. This includes processing and analyzing all local and many external metadata that are relevant to data quality, size, provenance, usage and other attributes. This then helps us better understand usage patterns and improve efficiency of data handling within NEX. When large-scale workflows are executed on NEX, we capture information that is relevant to processing and that can be analyzed in order to improve efficiencies in job scheduling, resource optimization, or data partitioning that would improve processing throughput. At this point we also collect data provenance as well as basic statistics of intermediate and final products created during the workflow execution. These statistics and metrics form basic process and data QA that, when combined with analytics algorithms, helps us identify issues early in the production process. We have already seen impact in some petabyte-scale projects, such as global Landsat processing, where we were able to reduce processing times from days to hours and enhance process monitoring and QA. While the focus so far has been mostly on support of NEX operations, we are also building a web-based infrastructure that enables users to perform direct analytics on science data – such as climate predictions or satellite data. Finally, as one of the main goals of NEX is knowledge acquisition and sharing, we began gathering and organizing information that associates users and projects with data, publications, locations and other attributes that can then be analyzed as a part of the NEX knowledge graph and used to greatly improve advanced search capabilities. Overall, we see data analytics at all levels as an important part of NEX as we are continuously seeking improvements in data management, workflow processing, use of resources, usability and science acceleration.