Optimizing BOUT++ MPI+OpenMP hybrid performance by refactoring compute kernels



For 70 years, humans have been pursuing nuclear fusion as a clean, safe and near limitless source of energy production. In the 2020s, we are now entering a delivery phase, with the final generation of experimental machines becoming operational, and planning for demonstration power plants well underway.

Nuclear fusion reactors – called “tokamaks” – are significant engineering endeavours that are only possible through major financial investment, and are often national programmes (like STEP in the UK) or international collaborations (like the global collaboration Iter). Tokamaks require the support of a sophisticated ecosystem of software to design, operate and maintain. As tokamaks grow in sophistication (and size!) the computing resources required by this software also grow.

BOUT++ is a software framework for simulating the plasma inside a tokamak. It is designed to be both performant and easy-to-use, and as such, it has developed an international user base in the UK, US, Europe and Asia. BOUT++ is also flexible and is able to run across a range of computers, from laptops to national computing facilities like ARCHER2. But to run on the world’s largest supercomputers, it is necessary for BOUT++ to further exploit parallelism – the ability to solve many interlinked components of a problem concurrently.

One feature that makes BOUT++ attractive to users is its “Domain Specific Language” (DSL) – users can write code that is very similar to the equations being solved, so that for example, df/dt becomes ddt(f). While this makes for a very natural user experience, it limits the available parallelism, as each term in a user’s physics model is a separate loop over the simulation domain. The reduction in simulation time gained from performing (relatively small) tasks in parallel is lost to overheads in repeated opening and closing parallel regions.

In this project, we have amended the DSL to allow the parallel regions to be fused together, reducing the overheads from opening parallel regions and realising greater speed-ups. We did this by moving all the physics model terms into a single large loop over the simulation domain. This necessitated a change to the DSL, where we pass the position in the simulation domain i to the function (e.g. ddt(f,i)). Thus for only a small change to the interface, users achieve a large performance gain.

Although the motivation for this work was to increase the parallelism, this change also improves data access, the pattern in which data is loaded onto compute processors for manipulation. This improves code performance regardless of the degree of parallelism used – even for simulations run in serial. This refactoring also puts the source code in a form that enables a port to graphical processing units (GPUs) or hardware-agnostic programming paradigms (code that can run on CPUs or GPUs). This prepares BOUT++ to run on the upcoming generation of Exascale computers, which are likely to be heterogeneous both across the range of available machines and within any given device.

By allowing greater use of parallelism, this work allows BOUT++ users to produce higher-fidelity simulations of tokamaks, increasing our understanding of critical plasma physics processes. This also allows BOUT++ to perform currently tractable simulations in a shorter time, enabling larger ensembles of simulations to be run. This is crucial for enabling faster parameter scans and performing uncertainty quantification (UQ), a technique that is becoming increasingly important for assessing the robustness of physics results, and ensuring BOUT++ can be included within engineering and design workflows.

Information about the code

The BOUT++ repository can be cloned from GitHub, and the example cases built by following the README.

Other physics modules are also available on GitHub:

Bout++ Project website

The code has been compiled using the following modules:

  • PrgEnv-gnu
  • cray-fftw
  • petsc
  • cray-netcdf-hdf5parallel

Also, to use the Python functionality the following modules are need:

  • cray-python
  • python-netCDF4
  • python-plotting

OpenMP (both in Bout++ and the pvode solver) can be enabled at the configure stage.

Technical Report

Download as PDF