Parallelism is central to modern scientific computing in that it provides the potential both to increase problem size and to reduce time to solution. At the level of the application, parallelism is typically expressed as a mixture of two levels (at least): MPI for distributed memory and a shared memory model at the node level, be it OpenMP for host parallelism or some GPU model for device parallelism. Thus “hybrid programming”.

This presentation will focus on a series of practical problems arising in the development of Ludwig, a lattice Boltzmann code aimed at complex fluids. First, I will give an overview of the abstraction adopted at the shared memory level to cope with the advent of GPUs. Second, a general measure of performance will be discussed in the context of ARCHER2 to show that the hybrid approach is effective at the socket level. Third, the interaction between the two levels of parallelism at the point of the nearest-neighbour halo exchange will be examined to consider a number of strategies including the use of tasks. Fourth, the task picture will be extended to the use of graph execution in the GPU context. Fifth, results will be presented on the performance including i/o in the high-scaling regime on ARCHER2.

The talk will illustrate approaches have been found useful, but will also describe misadventures which lead to relatively poor performance. The talk is aimed primarily at application programmers.

This online session is open to all. It will use the Blackboard Collaborate platform.

eCSE project eCSE01-26

Video