Agile data management for curation of genomes to watershed datasets

Monday, 14 December 2015
Poster Hall (Moscone South)
Deb Agarwal1, Charukela Varadharajan2, Boris Faybishenko2 and Roelof Versteeg3, (1)LBNL, Berkeley, CA, United States, (2)Lawrence Berkeley National Laboratory, Berkeley, CA, United States, (3)Subsurface Insights, Hanover, NH, United States
A software platform is being developed for data management and assimilation [DMA] as part of the U.S. Department of Energy’s Genomes to Watershed Sustainable Systems Science Focus Area 2.0. The DMA components and capabilities are driven by the project science priorities and the development is based on agile development techniques. The goal of the DMA software platform is to enable users to integrate and synthesize diverse and disparate field, laboratory, and simulation datasets, including geological, geochemical, geophysical, microbiological, hydrological, and meteorological data across a range of spatial and temporal scales.

The DMA objectives are (a) developing an integrated interface to the datasets, (b) storing field monitoring data, laboratory analytical results of water and sediments samples collected into a database, (c) providing automated QA/QC analysis of data and (d) working with data providers to modify high-priority field and laboratory data collection and reporting procedures as needed. The first three objectives are driven by user needs, while the last objective is driven by data management needs.

The project needs and priorities are reassessed regularly with the users. After each user session we identify development priorities to match the identified user priorities. For instance, data QA/QC and collection activities have focused on the data and products needed for on-going scientific analyses (e.g. water level and geochemistry). We have also developed, tested and released a broker and portal that integrates diverse datasets from two different databases used for curation of project data. The development of the user interface was based on a user-centered design process involving several user interviews and constant interaction with data providers. The initial version focuses on the most requested feature – i.e. finding the data needed for analyses through an intuitive interface. Once the data is found, the user can immediately plot and download data through the portal. The resulting product has an interface that is more intuitive and presents the highest priority datasets that are needed by the users.

Our agile approach has enabled us to build a system that is keeping pace with the science needs while utilizing limited resources.