A33A-0123
Running climate model in the commercial cloud computing environment: A case study using Community Earth System Model (CESM)
Wednesday, 16 December 2015
Poster Hall (Moscone South)
Xiuhong Chen, Xianglei Huang, Chaoyi Jiao, Mark Flanner, Todd Raeker and Brock Palen, University of Michigan Ann Arbor, Ann Arbor, MI, United States
Abstract:
Numerical model is the major tool used in the studies of climate change and climate projection. Because of the enormous complexity involved in such climate models, they are usually run on supercomputing centers or at least high-performance computing clusters. The cloud computing environment, however, offers an alternative option for running climate models. Compared to traditional supercomputing environment, cloud computing offers more flexibility yet also extra technical challenges. Using the CESM (community earth system model) as a case study, we test the feasibility of running the climate model in the cloud-based virtual computing environment. Using the cloud computing resources offered by Amazon Web Service (AWS) Elastic Compute Cloud (EC2) and an open-source software, StarCluster, which can set up virtual cluster, we investigate how to run the CESM on AWS EC2 and the efficiency of parallelization of CESM on the AWS virtual cluster. We created virtual computing cluster using StarCluster on the AWS EC2 instances and carried out CESM simulations on such virtual cluster. We then compared the wall-clock time for one year of CESM simulation on the virtual cluster with that on a local high-performance computing (HPC) cluster with infiniband connections and operated by the University of Michigan. The results show that the CESM model can be efficiently scaled with number of CPUs on the AWS EC2 virtual computer cluster, and the parallelization efficiency is comparable to that on local HPC cluster. For standard configuration of the CESM at a spatial resolution of 1.9-degree latitude and 2.5-degree longitude, increasing the number of CPUs from 16 to 64 leads to a more than twice reduction in wall-clock running time and the scaling is nearly linear. Beyond 64 CPUs, the communication latency starts to overweight the saving of distributed computing and the parallelization efficiency becomes nearly level off.