Impact of CPU frequency on application performance on ARCHER2


On the 12 December 2022 the default CPU frequency on ARCHER2 compute nodes was set to be 2.0 GHz. Previously, the CPU frequency was unset which meant that the base frequency of the CPU, 2.25 GHz, was almost always used by jobs running on the compute nodes. In this post, I summarise some of the benchmarking work that led to this change and point to how users can access information on energy use of jobs and how this information can be used to assess the energy efficiency of their use of ARCHER2.

Introduction

The frequency of the CPU is one of the key factors that control the rate at which it can perform operations. Typically, the higher the frequency the CPU is running at, the more times it can execute (potentially multiple or parallel) operations per second and this can lead to higher performance for applications - particularly those that are critically dependent on floating point performance. However, the higher the CPU frequency, the more power it draws and so it consumes more energy per second. This means that part of the energy consumption of an HPC job is determined by the ratio between runtime (or performance) of the application and power draw (which can be controlled by setting CPU frequency). As the relationship of CPU frequency to application performance and CPU frequency to power draw may not follow the same trends (e.g. one could be the linear and the other quadratic) it may be possible to reduce the power draw by a certain amount while reducing the performance by a smaller amount and this could lead to an overall energy consumption saving for a particular job (even though the job itself may run for longer than at a higher frequency).

Why are energy consumption and power draw important on ARCHER2? There are a number of benefits and potential benefits to reducing the energy consumption and power draw of the service. In terms of energy consumption, the primary benefit is that it reduces the overall cost of the ARCHER2 service - using less energy while maintaining performance leads to reducing the energy bills for running ARCHER2 while delivering a similar level of research output. A secondary benefit is that a reduction in energy consumption for ARCHER2 potentially frees up some 100% renewable energy capacity - ARCHER2 is powered by 100% certified renewable energy - that can then be sold to other organisations increasing the coverage of green energy in the UK. In terms of power draw, a reduction in power draw of ARCHER2 makes the service more friendly to the UK National Grid, allowing the additional power capacity to be redistributed for other uses - a consideration that is particularly important this Winter where there are concerns that there may be periods where the UK National Grid capacity will be under severe pressure. Reducing the power draw to the processors also has the potential to reduce the cooling overheads for the service as a whole which both reduces energy use by the service further and makes the service more resilient to high air temperatures such as the extreme heat events we saw during Summer 2022.

Approach

To estimate the potential impact on performance and energy use of changes to the ARCHER2 CPU frequency, the CSE team ran a number of application benchmarks with varying CPU frequencies and measured the performance (as reported by the applications) and the total compute node energy use as recorded by Slurm in the ConsumedEnergyRaw job property (the energy use data in Slurm comes from the pm_counters plugin). If you wish to look at the energy use for your jobs on ARCHER2, you should take a look at the Energy use section of the ARCHER2 User Guide. In fact, we recommend that all users examine the variation of energy use with CPU frequency for their jobs so they can choose the right setting for their applications.

The processors on ARCHER2 support three frequency settings: 1.5 GHz, 2.0 GHz and 2.25 GHz. We used the --cpu-freq Slurm option to control the processor frequencies as described in the ARCHER2 User Guide.

Once we had the raw measurements for each of application benchmarks, we computed the mean performance and energy consumption at 2.25 GHz and then calculated the relative performance and energy consumption for each benchmark run so we could plot relative performance against relative energy consumption for each of the CPU frequency settings.

Application benchmarks

We ran four different application benchmarks:

All benchmarks were run as pure MPI using 128 MPI processes per node. We ran using both 1 node (128 MPI processes) and 4 nodes (512 MPI processes).

Results

The plots below show the results of the single node benchmark runs for each of the applications - each plot shows relative performance against relative total energy use. The performance and energy are relative to the mean values at 2.25 GHz.

Relative performance against relative total energy use for the VASP TiO2 benchmark Relative performance against relative total energy use for the VASP TiO2 benchmark

Relative performance against relative total energy use for the OpenSBLI TGV512ss benchmark Relative performance against relative total energy use for the OpenSBLI TGV512ss benchmark

Relative performance against relative total energy use for the GROMACS 1400k benchmark Relative performance against relative total energy use for the GROMACS 1400k benchmark

Relative performance against relative total energy use for the CASTEP Al3x3 benchmark Relative performance against relative total energy use for the CASTEP Al3x3 benchmark

From the plots we can see that, for all of the benchmarks, the energy consumed reduces as we move from the default, highest (2.25 GHz) CPU frequency to 2.0 GHz. A further reduction to 1.5 GHz leads to an increase in energy consumption from 2.0 GHz. It is clear that for these benchmarks, running at 2.0 GHz minimises the energy consumption. Although reducing the overall energy consumption is generally a good thing for the ARCHER2 service, if it comes at the cost of reducing the performance by a similar amount or more, it is a false economy as we will then achieve significantly less scientific output for the same period of time on the service. What we want is a large reduction in energy consumption for a small (ideally, zero) reduction in performance.

The table below summarises the benchmark results when moving from 2.25 GHz to 2.0 GHz - given by the mean relative energy use/relative performance at 2.0 GHz compared to the values at 2.25 GHz.:

Benchmark Relative Energy Use Relative Performance
VASP TiO2 -20% -1%
OpenSBLI TGV512ss -20% -5%
GROMACS 1400k -5% -15%
CASTEP Al3x3 -13% -1%

Of the four benchmarks, three of them show significant energy reductions (13-20%) for modest decreases in performance (1-5%) showing that a reduction in CPU frequency to 2.0 GHz can have a positive impact on the energy efficiency on ARCHER2. The GROMACS benchmark shows a much smaller reduction in energy use (5%) and a larger impact of performance (15%) indicating that a change to 2.0 GHz is not beneficial in terms of energy efficiency in this case. The difference for the GROMACS benchmark is likely because the performance of this benchmark (and the majority of use of GROMACS) is strongly CPU-bound and so is sensitive to running at a slower clock speed. Conversely, the other benchmarks are not as strongly compute-bound and so show good energy efficiency gains when running at a lower clock speed for little loss in performance.

These trends are maintained when we run the benchmarks in larger jobs at 4 nodes (512 MPI processes).

Conclusions and recommendations

Based on the benchmarking performed by the ARCHER2 CSE team, we are implementing two changes on ARCHER2 on Monday 12 December 2022:

  1. The flag SLURM_CPU_FREQ_REQ will be set to the value 2000000 (2.0 GHz) for all ARCHER2 users
  2. The GROMACS, LAMMPS and NAMD software modules will set SLURM_CPU_FREQ_REQ=2250000 (2.25 GHz) as these software do not show improved energy efficiency with the lower CPU frequency setting.

We also strongly recommend that users test the performance and energy consumption of their applications to check what the correct setting for the CPU frequency is best for their work. The Energy use section of the ARCHER2 User Guide describes how you can extract the energy consumption of jobs and how you can set the CPU frequency.

We will continue to investigate how CPU frequency affects the energy use and performance of applications on ARCHER2. We will also be assessing the impact of this change over the next few months and plan to look at additional ways to provide feedback to users on their energy consumption on ARCHER2.