IN51A-1795
Feature extraction from scientific datasets using Apache Spark

Friday, 18 December 2015
Poster Hall (Moscone South)
Jan Paral, National Center for Atmospheric Research, Boulder, CO, United States and Michael James Wiltberger, National Center for Atmospheric Research, High Altitude Observatory, Boulder, CO, United States
Abstract:
We present an example of feature extraction from scientific datasets such as global numerical models using Apache Spark. The algorithm uses a simple penalized linear regression technique and a training dataset to learn and extract a similar feature from the rest of the data. Thanks to Apache Spark, algorithm can scale to a large number of computing nodes.