A Machine Learning Approach to Quality Control Oceanographic Data

Guilherme P Castelao, Scripps Institution of Oceanography, La Jolla, CA, United States
Abstract:
Sampling errors are inevitable when measuring the oceans; thus, a quality control (QC) procedure capable of detecting spurious measurements is necessary to produce reliable datasets. While manual QC by human experts results in the best data quality, it can quickly become inefficient when handling large datasets, and it is also vulnerable to inconsistencies between different individuals. Even though automatic QC may overcome those issues, traditional QC methods often result in high rates of false positives. Here, I present a machine learning approach to automatically QC oceanographic data based on the Anomaly Detection technique. Multiple tests are combined into a single multidimensional criterion that learns the behavior of the good measurements, and identifies bad samples as outliers. When applied to 13 years of hydrographic profiles, the Anomaly Detection approach resulted in the best classification performance, reducing the error by at least 50% with respect to standard procedures. In support of the International Quality-Controlled Ocean Database (IQuOD) initiative, the Anomaly Detection was implemented as a reinforced learning system to coordinate and optimize the human expert workforce allowing for unprecedented efficiency to scan the whole World Ocean Database for miss-flagged measurements. An open-source Python package, CoTeDe, was developed to implement the Anomaly Detection and other state-of-the-art procedures to quality control oceanographic data profiles (e.g., CTD, XBT, Argo, underwater gliders) as well as along-track measurements such as TSGs.