Towards a Balanced-Labeled-Dataset of Planktons for a Better In-Situ Taxa Identification
Abstract:
Commonly, planktonic labeled datasets for training suffer from class-imbalance. Most of the data belongs to a few categories; and accordingly, the classification task tends to be biased toward highly represented classes. To mitigate this issue, classical strategies based on class resampling and cost-sensitive training were introduced. These strategies, while solving the class imbalance, they may cause data overfitting or yield valuable information elimination [Huang et al., 2016, Krawczyk, 2016]. With the proposition of the CGAN-Plankton by Wang et al. [2017], an offline DNN generates synthetic data adding more instances to classes with low representation. The resulting dataset is utilized to enhance DNN classification capability in an encapsulated manner that is not affecting biological identification processes.
In this paper, we provide a thorough comparison between methodologies that solve the class imbalance problem over labeled planktonic datasets such as WHOI [Orenstein et al., 2015], ZooScan [Gorsky et al., 2010], and Kaggle [Cowen and Guigand, 2008]. The CGAN-Plankton is noted to achieve a well-balanced dataset that achieves a better classification accuracy. To further assess its real-world applicability, we use the produced dataset and apply augmentation techniques such as in [Cui et al., 2015] to train a DNN architecture mounted on an autonomous underwater system for in-situ plankton identification. Results show a significant improvement on the DNN classification performance over in-situ captured data.