Towards a Balanced-Labeled-Dataset of Planktons for a Better In-Situ Taxa Identification

Oda Scheen Kiese, Aya Saad and Annette Stahl Prof, Norwegian University of Science and Technology, Engineering Cybernetics, Trondheim, Norway
Abstract:
Studying the abundance and dispersion of plankton organisms in-situ is a driver to recent research activities and oceanography due to their ecological importance. With the introduction of underwater marine robots equipped with sensors and advanced cameras, in-situ identification and classification of underwater microscopic organisms are now possible [Sieracki et al., 2010, Jaffe, 2014]. This classification task highly depends on efficient analysis of captured images from the water columns. Efficiently identifying and classifying plankton organisms in-situ can be achieved by adopting deep learning neural network (DNN) architectures and installing them on underwater robot systems. However, the level of accuracy may vary based on the quality of the labeled dataset used for DNN training.

Commonly, planktonic labeled datasets for training suffer from class-imbalance. Most of the data belongs to a few categories; and accordingly, the classification task tends to be biased toward highly represented classes. To mitigate this issue, classical strategies based on class resampling and cost-sensitive training were introduced. These strategies, while solving the class imbalance, they may cause data overfitting or yield valuable information elimination [Huang et al., 2016, Krawczyk, 2016]. With the proposition of the CGAN-Plankton by Wang et al. [2017], an offline DNN generates synthetic data adding more instances to classes with low representation. The resulting dataset is utilized to enhance DNN classification capability in an encapsulated manner that is not affecting biological identification processes.

In this paper, we provide a thorough comparison between methodologies that solve the class imbalance problem over labeled planktonic datasets such as WHOI [Orenstein et al., 2015], ZooScan [Gorsky et al., 2010], and Kaggle [Cowen and Guigand, 2008]. The CGAN-Plankton is noted to achieve a well-balanced dataset that achieves a better classification accuracy. To further assess its real-world applicability, we use the produced dataset and apply augmentation techniques such as in [Cui et al., 2015] to train a DNN architecture mounted on an autonomous underwater system for in-situ plankton identification. Results show a significant improvement on the DNN classification performance over in-situ captured data.