Re-training a Joint U-Net-CNN Deep Learning Image Classification Pipeline for the Segmentation of Subsea Macrofauna

Mitchell Scott1, Bhuvan Malladihalli Shashidhara2 and Aaron Marburg1, (1)Applied Physics Laboratory University of Washington, Seattle, WA, United States, (2)University of Washington, Seattle, WA, United States
Abstract:
The Ocean Observatories Initiative (OOI) Regional Cabled Array (RCA) hosts a high-definition video camera (CamHD) located at the Ashes hydrothermal vent field within the Axial volcano caldera. The camera, positioned 1.5m from the 2m tall active vent Mushroom, provides an unprecedented opportunity for visual observation of a chemotrophic vent ecosystem. A variety of macrofauna exist within this ecosystem, including scale worms, and the CamHD video stream provides a time series observation of these fauna in their natural environment. Unfortunately, due to the number of videos, effective quantification of faunal populations can only really be achieved through automated mechanisms, e.g. deep learning. However, the fauna are both small sized and well camouflaged, making object recognition and segmentation difficult. In the presented work, a deep learning image classification pipeline has been implemented to segment scale worms within static CamHD images.

The presented segmentation pipeline consists of two seperate deep learning networks: U-Net, originally designed for biomedical image segmentation, and a VGG16 CNN, a common lightweight convolutional neural network. Images from CamHD are fed to the U-Net model, which performs semantic segmentation, before being cascaded into the VGG16 CNN for a final yes/no classification.

When trained and tested on a subset of Mushroom scenes, this pipeline has been shown to have an overall Average Precision of 0.671 at an Intersection over Union (IOU) value of 0.5 and a U-Net pixel-wise validation accuracy of 0.99. However, system accuracy was observed to decay as the test data became more temporally distant from the training data, likely due to the dynamic Mushroom environment scene. Such decay mandates regular network re-training, a tedious task given that each of the two sub-components must be trained independently of one-another. This work demonstrates that the U-Net portion of this pipeline is more robust to time variation than the CNN and that the re-training frequency of the U-Net remains much lower than the accompanying CNN. Furthermore, we show that substantial U-Net data augmentation typically has negligible impact on network performance, whereas CNN performance varies substantially with training data size.