Toward Transparent and Reproducible Science: Using Open Source "Big Data" Tools for Water Resources Assessment

Tuesday, 16 December 2014: 8:15 AM
Wouter Buytaert, Imperial College London, Civil and Environmental Engineering and Grantham Institute for Climate Change, London, SW7, United Kingdom; Escuela Politécnica Nacional, Quito, Ecuador, Zed Diyana Zulkafli, Imperial College London, London, SW7, United Kingdom; Universiti Putra Malaysia, Department of Civil Engineering, Serdang, Malaysia and Claudia Vitolo, Imperial College London, London, United Kingdom
Transparency and reproducibility are fundamental properties of good science. In the current era of large and diverse datasets and long and complex workflows for data analysis and inference, ensuring such transparency and reproducibility is challenging. Hydrological science is a good case in point, because the discipline typically uses a large variety of datasets ranging from local observations to large-scale remotely sensed products. These data are often obtained from various different sources, and integrated using complex yet uncertain modelling tools. In this paper, we present and discuss methods of ensuring transparency and reproducibility in scientific workflows for hydrological data analysis for the purpose of water resources assessment, using relevant examples of emerging open source “big data” tools.

First, we discuss standards for data storage, access, and processing that allow improving the modularity of a hydrological analysis workflow. In particular standards emerging from the Open Geospatial Consortium, such as the Sensor Observation Service, the Web Coverage Service, hold promise. However, some bottlenecks such as the availability of data models and the ability to work with spatio-temperal subsets of large datasets, need further development.

Next, we focus on available methods to build transparent data processing workflows. Again, standards such as OGC’s Web Processing Service are being developed to facilitate web-based analytics. Yet, in practice, the experimental nature of these standards and web services in general often requires a more pragmatic approach. The availability of web technologies in popular open source data analysis environments such as R and Python often makes them an attractive solution for workflow creation and sharing.

Lastly, we elaborate on the potential of open source solutions hold in the context of participatory approaches to data collection and knowledge generation. Using examples from the tropical Andes and the Himalayas, we show how these technologies can help reducing the traditional knowledge gap, by allowing participation from resources constraint stakeholders.