Using Machine Learning to Find Relationships in Oceanographic Datasets

Christopher Holder, Johns Hopkins University, Department of Earth and Planetary Sciences, Baltimore, MD, United States and Anand Gnanadesikan, Johns Hopkins University, Earth & Planetary Sciences, Baltimore, United States
Abstract:
Understanding the relationships between phytoplankton and the environmental components that limit their growth can be difficult. Recently, machine learning (ML) methods have been used to untangle these interactions. However, there has been limited effort to understand if the ML methods are finding the correct relationships between the variables. We created a simple phytoplankton model based on realistic environmental relationships to investigate whether ML methods were capable of finding the correct relationships and to find limitations of these methods.

In the simple model, phytoplankton biomass was the response (dependent) variable, and phosphate, iron, and light were the predictor (independent) variables. Values for the predictor variables came from a biogeochemical model whose values were already based on observational distributions for each predictor.

The output from the simple model was split into a training and testing dataset. We trained a multiple linear regression (MLR) model, random forest (RF), and neural network ensemble (NNE) using the training dataset. When applied to the testing dataset, the NNE showed the best performance (R2 > 0.99) of the ML methods.

To examine the relationships in the model, we created an artificial set of observations where one predictor was varied across its min-max range, while the other predictors were held at a constant value. These new predictor values were then run through the simple model and were also provided to the MLR and ML methods. In each circumstance, the NNE provided the closest estimates to the actual relationships.

Our results show that the ML methods are capable of finding close approximations for relationships in phytoplankton datasets. We also learned that several limitations might exist depending on the spatiotemporal resolution of the data. In future research, we plan on applying these techniques to phytoplankton observational datasets and Earth System Model outputs.