IN41A-1681
Big Data Archives: Replication and synchronizing on a large scale

Thursday, 17 December 2015
Poster Hall (Moscone South)
Todd A King, University of California Los Angeles, EPSS, Los Angeles, CA, United States and Raymond J Walker, University of California Los Angeles, Earth, Planetary, and Space Sciences, Los Angeles, CA, United States
Abstract:
Modern data archives provide unique challenges to replication and synchronization because of their large size. We collect more digital information today than any time before and the volume of data collected is continuously increasing. Some of these data are from unique observations, like those from planetary missions that should be preserved for use by future generations. In addition data from NASA missions are considered federal records and must be retained. While the data may be stored on resilient hardware (i.e. RAID systems) they also must be protected from local or regional disasters. Meeting this challenge requires creating multiple copies. This task is complicated by the fact that new data are constantly being added creating what are called "active archives". Having reliable, high performance tools for replicating and synchronizing active archives in a timely fashion is critical to preservation of the data. When archives were smaller using tools like bbcp, rsync and rcp worked fairly well. While these tools are affective they are not optimized for synchronizing big data archives and their poor performance at scale lead us to develop a new tool designed specifically for big data archives. It combines the best features of git, bbcp, rsync and rcp. We call this tool "Mimic" and we discuss the design of the tool, performance comparisons and its use at NASA's Planetary Plasma Interactions (PPI) Node of the Planetary Data System (PDS).