U23A-05
Curse of the Zombie Petabytes

Tuesday, 15 December 2015: 15:00
3002 (Moscone West)
Jean-Bernard H Minster, University of California San Diego, La Jolla, CA, United States; Scripps Institution of Oceanography, La Jolla, CA, United States
Abstract:
Total world digital information, estimated at about 1 zetabyte (ZB=10^21) in 2010, is forecast to reach 30-40 ZB by 2020. The storage available will then only be about 15 ZB. Scientific data, while a small fraction of that volume, pose a thorny problem for data-driven Earth science disciplines where temporal trends are critical: Older data sets, far from becoming obsolete, often gain in importance as time passes, and must therefore remain usable through careful curation. Unlike stone tablets, parchments, or high-quality paper, which have shelf lives of centuries or even millennia, digital data have a half-life measured in years to decades, not only because of medium degradation but, more critically, because of the rapid obsolescence of the underlying technology, in particular the software infrastructure. It is nigh impossible to use 7-track, 800bpi tapes from the 60’s, and it is questionable whether today’s media could be used in 2050. To combat obsolescence, data centers adopt a rigid discipline of “data migration,” with holdings regularly transferred to new technology every decade or so. This is typically required by parent agencies, and organizations such as the ICSU World Data System. In this light, it is clear that salvaging data from last century so that it can be analyzed simultaneously with current data becomes a massive migration exercise that is not helped by the all too common historical impulse to “save everything” (e.g., all successive versions of data, metadata, and higher-level products) as if resources and budgets were limitless. Further, software advances, longer series, better science, sometimes call for ever-improved reprocessing of entire data sets, which must be curated for that purpose in a condition that permits uniform processing. Other atavisms (e.g., save intermediate results, ostensibly as evidence trail; preserve all results for posterity, correct or not) leave “zombie” data sets never to be accessed for research. If unchecked, such practices could well overwhelm what Moore’s law can sustain. This is even more acute for simulation outputs, which can be essentially unbounded. I therefore argue that we need to develop a process and effective strategies acceptable to each community to select which data to save for long-term curation and which to discard, or face the zombies again in a few decades.