IN43A-1717
Parallel and Scalable Clustering and Classification for Big Data in Geosciences

Thursday, 17 December 2015
Poster Hall (Moscone South)
Morris Riedel, Jülich Supercomputing Centre, Julich, Germany
Abstract:
Machine learning, data mining, and statistical computing are common techniques to perform analysis in earth sciences. This contribution will focus on two concrete and widely used data analytics methods suitable to analyse ‘big data‘ in the context of geoscience use cases: clustering and classification. From the broad class of available clustering methods we focus on the density-based spatial clustering of appliactions with noise (DBSCAN) algorithm that enables the identification of outliers or interesting anomalies. A new open source parallel and scalable DBSCAN implementation will be discussed in the light of a scientific use case that detects water mixing events in the Koljoefjords. The second technique we cover is classification, with a focus set on the support vector machines algorithm (SVMs), as one of the best out-of-the-box classification algorithm. A parallel and scalable SVM implementation will be discussed in the light of a scientific use case in the field of remote sensing with 52 different classes of land cover types.