Contributing datasets from the Imaging Flow Cytobot for Essential Ocean Variables

Katherine Qi1,2, Stace Beaulieu3, Joe Futrelle4, Emily Peacock4 and Heidi M Sosik4, (1)University of California San Diego, La Jolla, United States, (2)Woods Hole Oceanographic Institution, Woods Hole, United States, (3)Woods Hole Oceanographic Institution, Biology, Woods Hole, MA, United States, (4)Woods Hole Oceanographic Institution, Woods Hole, MA, United States
The development of plankton imaging systems has enabled novel insights into plankton ecology. As these systems become more widely used, the corresponding increase in data volume creates new challenges for data management and analysis. One example is the Imaging FlowCytobot (IFCB); this instrument specializes in capturing images of particles within the range of 10 µm to 150 µm and provides metadata about the corresponding samples. In this study, we performed quantitative analyses on IFCB data from a 2018 Northeast U.S. Shelf Long Term Ecological Research (NES-LTER) cruise. The images were classified using a supervised machine learning algorithm trained with expert human annotations. By integrating the classifications with other metadata, we calculated taxon abundance and biovolume concentrations of specified phytoplankton groups across the shelf. With the goal of reproducing this workflow for previous and future NES-LTER transect cruises, we developed and documented online, open-source software tools that can be reconfigured for new IFCB datasets. This workflow implemented an intermediate data model to facilitate standardizations to Essential Ocean Variables (EOVs) and reusability in global information systems. The results were packaged using Ecological Metadata Language (EML) and uploaded to the Environmental Data Initiative (EDI) repository. This framework enables the contribution of IFCB data products to community repositories addressing Findable, Accessible, Interoperable, and Reusable (FAIR) principles. As data acquisition rates and volume continue to increase, it is increasingly critical to adopt reproducible data practices as we maintain, analyze, and share data.