Our efforts are refactorizing and optimizing the Community Atmosphere Model (CAM) on the new Sunway supercomputer, which uses a many-core processor that consists of management processing elements (MPEs) and clusters of computing processing elements (CPEs). To map the large code base of CAM to millions of cores on the Sunway system, we take OpenACC-based refactorization as major tool, and apply source-to-source translator tools to generate the most suitable parallelism for the CPE cluster, and to fit the intermediate variable into the limited on-chip fast buffer. For single kernels, when comparing the original ported version using only MPEs and the refactorized version using both the MPE and CPE clusters, we achieve up to 22x speedup for the computer-intensive kernels. For the 25km resolution CAM global model, we manage to scale to 24,000 MPEs, and 1,536,000 CPEs, and achieve a simulation speed of 2.81 model years per day.
The speedup of major kernels in CAM-SE that we port onto the CPE clusters. The speedup is comparing the performance of the kernel running on 1 MPE and 64 CPEs against the performance of the kernel running on only 1MPE.
The simulation speed of the CAM model (measured in Model Years Per Day (MYPD)) on the new Sunway supercomputer, with the number of CGs increasing from 1,024 to 24,000.