This course covers techniques for improving the performance of parallel applications by optimising of the code that runs within each node.
Modern HPC systems such as ARCHER2 are being constructed using increasingly powerful nodes, with larger and larger numbers of cores and enhanced vector capabilities. To extract maximum performance from applications, it is therefore necessary to understand, and be able to overcome, on-node performance bottlenecks. This course will cover the main features of modern HPC nodes, including multiple cores, vector floating point units, deep cache hierarchies, and NUMA memory systems. We will cover techniques for efficient programming of these features, using batch processing options and compiler options as well as hand tuning of code. The course will also contain an introduction to the use of Cray performance analysis tools.
Prerequisites:
Participants must have attended ARCHER2 for Software Developers or be familiar with software development on ARCHER, ARCHER2, or any other HPC facility, using C, C++ or Fortran.
This course is targeted at users interested in optimising the performance of their own applications, e.g. through compiler options or code changes.
Users interested in efficient use of centrally installed packages should consider attending Understanding Package Performance instead
Requirements:
Participants must bring a laptop with a Mac, Linux, or Windows operating system (not a tablet, Chromebook, etc.) that they have administrative privileges on.
They are also required to abide by the ARCHER2 Code of Conduct.
Timetable:
- Day 1: 23rd November 9:30-17:30
- 09.30 – 09.45 Introduction
- 09.45 – 10.30 Node Architecture
- 10.30 – 11.00 Practical – memory performance
- 11.00 – 11.30 Break
- 11.30 – 12.30 Profiling
- 12.30 – 13.00 Practical – profiling
- 13.00 – 14.00 Break
- 14.00 – 15.00 Optimising with the compiler
- 15.00 – 15.30 Break
- 15.30 – 17.00 Practical – profiling and optimization
- 17.00 – 17.10 Summary
- 17.10 – 17.30 Practical – profiling and optimization
- Day 2: 24th November 9:30-16:00
- 09.30 – 11.00 OpenMP optimisation
- 11.00 – 11.30 Break
- 11.30 – 12.30 Practical – OpenMP optimisation
- 12.30 – 13.30 Break
- 13.30 – 15.00 Vectorisation, Memory Hierarchy Optimisation
- 15.00 – 15.30 Break
- 15.30 – 16.00 Practical – memory and cache blocking