OpenTopography: Addressing Big Data Challenges Using Cloud Computing, HPC, and Data Analytics

Monday, 15 December 2014
Christopher J Crosby1, Vishu Nandigam2, Minh Phan2, Choonhan Youn2, Chaitanya Baru2 and Ramon Arrowsmith3, (1)UNAVCO, Boulder, CO, United States, (2)San Diego Supercomputer Center, La Jolla, CA, United States, (3)Arizona State University, EarthScope National Office, School of Earth and Space Exploration, Tempe, AZ, United States
OpenTopography (OT) is a geoinformatics-based data facility initiated in 2009 for democratizing access to high-resolution topographic data, derived products, and tools. Hosted at the San Diego Supercomputer Center (SDSC), OT utilizes cyberinfrastructure, including large-scale data management, high-performance computing, and service-oriented architectures to provide efficient Web based access to large, high-resolution topographic datasets. OT collocates data with processing tools to enable users to quickly access custom data and derived products for their application. OT’s ongoing R&D efforts aim to solve emerging technical challenges associated with exponential growth in data, higher order data products, as well as user base.

Optimization of data management strategies can be informed by a comprehensive set of OT user access metrics that allows us to better understand usage patterns with respect to the data. By analyzing the spatiotemporal access patterns within the datasets, we can map areas of the data archive that are highly active (hot) versus the ones that are rarely accessed (cold). This enables us to architect a tiered storage environment consisting of high performance disk storage (SSD) for the hot areas and less expensive slower disk for the cold ones, thereby optimizing price to performance.

From a compute perspective, OT is looking at cloud based solutions such as the Microsoft Azure platform to handle sudden increases in load. An OT virtual machine image in Microsoft’s VM Depot can be invoked and deployed quickly in response to increased system demand. OT has also integrated SDSC HPC systems like the Gordon supercomputer into our infrastructure tier to enable compute intensive workloads like parallel computation of hydrologic routing on high resolution topography. This capability also allows OT to scale to HPC resources during high loads to meet user demand and provide more efficient processing.

With a growing user base and maturing scientific user community comes new requests for algorithms and processing capabilities. To address this demand, OT is developing an extensible service based architecture for integrating community-developed software. This “plugable” approach to Web service deployment will enable new processing and analysis tools to run collocated with OT hosted data.