Parallelization of the Legendre Transform for a Geodynamics Code

Monday, 15 December 2014
Harsha Venkata Lokavarapu, Hiroaki Matsui and Eric M Heien, University of California Davis, Davis, CA, United States
Calypso is a geodynamo code designed to model magnetohydrodynamics of a Boussinesq fluid in a rotating spherical shell, such as the outer core of Earth. The code has been shown to scale well on computer clusters capable of computing at the order of millions of core hours. Depending on the resolution and time requirements, simulations may require weeks to years of clock time for specific target problems. A significant portion of the code execution time is spent transforming computed quantities between physical values and spherical harmonic coefficients, equivalent to a series of linear algebra operations. Intermixing C and Fortran code has opened the door to the parallel computing platform, Cuda and its associated libraries. We successfully implemented the parallelization of the scaling of the Legendre polynomials by both Schmidt Normalization coefficients, and a set of weighting coefficients; however, the expected speedup was not realized. Specifically, the original version of Calypso 1.1 computes the Legendre transform approximately four seconds faster than the Cuda-enabled modified version. By profiling the code, we determined that the time taken to transfer the data from host memory to GPU memory does not compare to the number of computations happening within the GPU. Nevertheless, by utilizing techniques such as memory coalescing, cached memory, pinned memory, dynamic parallelism, asynchronous calls, and overlapped memory transfers with computations, the likelihood of a speedup increases. Moreover, ideally the generation of the Legendre polynomial coefficients, Schmidt Normalization Coefficients, and the set of weights should not only be parallelized but be computed on-the-fly within the GPU. The end result is that we reduce the number of memory transfers from host to GPU, increase the number of parallelized computations on the GPU, and decrease the number of serial computations on the CPU. Also, the time taken to transform physical values to spherical coefficients, (the Legendre transform), depends on the number of shells of a planetary body, and the truncation level (modes). We investigate the effectiveness of an optimized Cuda-enabled Legendre transformation, by varying state variables, in an effort to improve the performance of these types of codes for models of core dynamics.