The many facets of integrating data and metadata for research networks: experience from the AmeriFlux Network

Thursday, 18 December 2014: 4:52 PM
Gilberto Pastorello1, Cristina Poindexter2, Catharine van Ingen1, Dario Papale3 and Deb Agarwal4, (1)Lawrence Berkeley National Laboratory, Berkeley, CA, United States, (2)University of California Berkeley, Berkeley, CA, United States, (3)Tuscia University, Department for Innovation in Biological, Agro-food and Forest systems (DIBAF), Viterbo, Italy, (4)LBNL, Berkeley, CA, United States
Grassroots research networks, such as AmeriFlux, require data and metadata integration from multiple, independently managed field sites, scales, and science domains. The goal of these networks is production of consistent datasets enabling investigation of broad science questions at regional and global scales. These datasets combine data from a large number of data providers, who often utilize different data collection protocols and data processing approaches. In this scenario, data integration and curation quickly become large-scale efforts. This presentation reports on our experience with integration efforts for the AmeriFlux network. In AmeriFlux we are attempting to integrate flux, meteorological, biological, soil, chemistry, and disturbance data and metadata. Our data management activities range from acquisition/publication mechanisms, quality control, processing and product generation, data and software synchronized versioning and archiving, and interaction mechanisms and tools for data providers and data users. To enable consistent data processing and network-level data quality, combinations of automated and visual data quality assessment procedures were built, extending on checks already done at site levels. The implementation of community developed and trusted algorithms to operate in production mode proved to be a key aspect of data product generation, with extensive testing and validation being one of the main concerns. Clear definitions for data processing levels help with easily tracking different data products and data quality levels. For metadata and ancillary information, formatting standards are even more relevant, since variables collected are considerably more heterogeneous. Documentation and training on the standards were crucial in this case, with instruction sessions having proved to be an effective approach, given that documentation cannot cover all different scenarios at different sites. This work is being developed in close coordination with ICOS in Europe and other regional networks aiming at regular and harmonized releases of data and metadata products for the FLUXNET network of flux monitoring networks.