Ocean Sciences meets Big Data Analytics

Bonnie L Hurwitz1, Illyoung Choi2 and John Hartman2, (1)University of Arizona, Agricultural and Biosystems Engineering, Tucson, AZ, United States, (2)University of Arizona, Computer Science, Tucson, AZ, United States
Abstract:
Hundreds of researchers worldwide have joined forces in the Tara Oceans Expedition to create an unprecedented planetary-scale dataset comprised of state-of-the-art next generation sequencing, microscopy, and physical/chemical metadata to explore ocean biodiversity. This summer the complete collection of data from the 2009-2013 Tara voyage was released. Yet, despite herculean efforts by the Tara Oceans Consortium to make raw data and computationally derived assemblies and gene catalogs available, most researchers are stymied by the sheer volume of the data. Specifically, the most tantalizing research questions lie in understanding the unifying principles that guide the distribution of organisms across the sea and affect climate and ecosystem function. To use the data in this capacity researchers must download, integrate, and analyze more than 7.2 trillion bases of metagenomic data and associated metadata from viruses, bacteria, archaea and small eukaryotes at their own data centers ( ~9 TB of raw data). Accessing large-scale data sets in this way impedes scientists’ from replicating and building on prior work. To this end, we are developing a data platform called the Ocean Cloud Commons (OCC) as part of the iMicrobe project. The OCC is built using an algorithm we developed to pre-compute massive comparative metagenomic analyses in a Hadoop big data framework. By maintaining data in a cloud commons researchers have access to scalable computation and real-time analytics to promote the integrated and broad use of planetary-scale datasets, such as Tara.