Ten Tips for using HPC

By Clair Barrass EPCC, Andy Turner EPCC on November 28, 2023

Tags: blog

Working with HPC can be exciting, innovative and rewarding. But it can also be complex, frustrating and challenging.

Andy Turner is EPCC’s CSE Architect, he works with RSEs, UK HPC-SIG, DiRAC, The Carpentries and various user communities to share and improve HPC epertise.

We asked Andy for his Top 10 Tips for using HPC, to help make your HPC experience smooth and successful.

1. Don’t struggle alone

Researchers are good at solving problems… but this can lead to a tendency to not ask for help soon enough

Where can you find support?

Service Desk

Most HPC services have a service desk staffed by experts who are there to help you.

If something is not working as you expect: contact them

Remember to include:

A description of the problem and/or the error messages you are seeing
The job ID and submission script if it is a problem with the job

Research Software Engineers

There are many RSE groups around the UK that can provide advice and support.

They often have a broad range of experience and can help with general advice as well as specific support.

2. Learn an in-terminal text editor

Even if you do not use for the majority of your job setup it will stand you in good stead

Options:

vim: can be tricky to learn but very powerful and available everywhere
emacs: more intuitive to use than vim and almost always available
nano: simple to use but less powerful, not always available

Some editors also support editing remote files over SSH, e.g. Visual Studio Code

3. Submit test jobs before scaling up

It is easy to waste a lot of resources when jobs do not behave as they should, not to mention the frustration of having to wait in the queue for days, only for the job to fail after 10s due to a typo!

Most HPC systems have a short/test queue that you can use for for testing, with fast turnaround

This is particularly important for more complex job scripts

Other scaling up issues you may hit:

IO model becomes a problem: 1 CP2K jobs writing data works fine, 5,000 CP2K jobs writing data becomes a problem
Organising and managing results and data becomes much more challenging:
- What if 1 run out of 5,000 fails? How do you identify it and understand what went wrong?
- How do you aggregate/analyse data?

4. Run some basic benchmarking

It is always tempting to just get on and get going with production runs… …but it is often useful to run some tests before you commit

Benchmarking:

Optimises your use of your allocated resources
- Can also be useful to evaluate the most energy efficient approach if your system supports reporting on energy use
Worth looking around to see if any advice already exists for the type of work you are doing
Increases in importance if the architecture is different from things you have used previously
Run some simple tests with restricted number of timesteps/iterations/SCF cycles
- For example, double node count, halve node count
Be aware of calculation/model input parameters that can change performance and test some values of these too:
- e.g. VASP: NCORE, KPAR; CASTEP: num_proc_in_cpu

5. Have a data management plan

Understand your data:
- Which is critical and needs to be backed up in multiple places?
- Which can be kept on HPC system and which needs to be transferred off?
- Which can be regenerated easily?
Do you need to archive files before transferring them?
Does compression help or hinder?
- Compression can be CPU-intensive - may not make sense for your data
How long will it take to transfer data?
- Can become CPU-bound due to encryption of transfers - significant speedup is often available by using faster encryption algorithms than the default (e.g. “scp -c aes128-ctr …”)

There is a data management guide for ARCHER2 users

6. Read the docs…

Most HPC facilities come with good documentation, it is worth browsing through this… … Not to memorise anything but to understand what is there to potentially look at in the future

Service desks will often point you to try advice from documentation before they investigate further

Documentation where the community can contribute is becoming more common, e.g.: ARCHER2 documentation

If you have looked through the docs and failed to find the topic you are looking for, let the service know so (e.g. by raising a github issue) so they can add it, or make finding it easier.

7. Take advantage of training courses

Lots of HPC, software, data analysis training material available

ARCHER2
- Lots of free online training
- Repository of past materials
The Carpentries
HPC Carpentry
CodeRefinery
Local training at your institution

ARCHER2 Training

Freely-available to all academics
Wide range of courses at different levels
Range of delivery methods
- In person
- Online taught
- Online self-service
All previous course materials freely available online (with recordings) for self-study

8. Learn some more Linux command line

Modern Linux command line has lots of features that can be invaluable in streamlining research and data analysis

Watch the webinar on Modern bash features
Other useful command line tools:
- sed, awk, paste, uuidgen
Combining bash and Python can lead to very powerful capabilities

9. Build on top of existing libraries/tools

Researchers are clever and creative and like building new things… But this is not usually the best approach when writing programs and scripts

Almost always better to build on top of existing libraries and tools than develop your own
If you are unsure what is already available, it is worth speaking to your local RSE group

10. Learn Pandas or R

Manipulating data is key to almost all research

Pandas (Python) and/or R allows you to quickly perform sophisticated analysis, plotting and transformations
- (and Pandas can often be parallelised using Dask)
- Easier to share and improve than worksheets
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming
R is a free software environment for statistical computing and graphics

In summary: