Lightweight Data Systems in the Cloud: Costs, Benefits and Best Practices
Tuesday, 15 December 2015
Poster Hall (Moscone South)
We present here a simple analysis of both the cost and the benefit of using the cloud in environmental science circa 2016. We present this set of ideas to enable the potential 'cloud adopter' research scientist to explore and understand the tradeoffs in moving some aspect of their compute work to the cloud. We present examples, design patterns and best practices as an evolving body of knowledge that help optimize benefit to the research team. Thematically this generally means not starting from a blank page but rather learning how to find 90% of the solution to a problem pre-built. We will touch on four topics of interest. (1) Existing cloud data resources (NASA, WHOI BCO DMO, etc) and how they can be discovered, used and improved. (2) How to explore, compare and evaluate cost and compute power from many cloud options, particularly in relation to data scale (size/complexity). (3) What are simple / fast 'Lightweight Data System' procedures that take from 20 minutes to one day to implement and that have a clear immediate payoff in environmental data-driven research. Examples include publishing a SQL Share URL at (EarthCube's) CINERGI as a registered data resource and creating executable papers on a cloud-hosted Jupyter instance, particularly iPython notebooks. (4) Translating the computational terminology landscape ('cloud', 'HPC cluster', 'hadoop', 'spark', 'machine learning') into examples from the community of practice to help the geoscientist build or expand their mental map. In the course of this discussion -- which is about resource discovery, adoption and mastery -- we provide direction to online resources in support of these themes.