Disaster recovery testing to improve resilience


To keep the ARCHER2 National HPC Service running around the clock requires specialised staff, covering everything from the Service Desk and science support, through to hardware maintenance and data-centre hosting, alongside third-party suppliers for power, networking, accommodation, and so on.

Coordinating these elements is a complex task, even in normal times, but when faced with adverse external events such as Storm Arwen or the Covid 19 pandemic, things could all too easily go wrong.

Because of this, EPCC has BCDR (or business continuity and disaster recovery) plans, to minimise the impact from external issues and help keep ARCHER2 running smoothly. Given the (hopefully) rareness of major incidents, it is good to ensure the ARCHER2 is confident with the plans and, to this end, we have a dedicated BCDR testing team.

A highlight of the team’s calendar is a large-scale, major-incident test. Past tests have involved simulating a fire at the ARCHER2 headquarters and an outbreak of food poisoning in the science-support staff. The details of each particular test are a closely guarded secret. Other than knowing roughly within a few weeks, when a test is due, staff have no a-priori knowledge of what each incident scenario entails.

Of course, the test team don’t have authority to actually set fire to university premises nor to poison staff, so a key part of test preparation involves contriving an implementation of the scenario that is as realistic as practically possible. This can require some creative thinking and help from outside. For the food poisoning scenario, the team created a timeline with staff falling ill at various points. The test team then made sure to intercept each staff member at the appropriate time and hasten them from their work environment. The team also had to liaise with Admin and HR, to avoid widespread panic as people began to sign off as sick.

image

Not all tests are run as full-scale scenarios. Sometimes the test team organises ‘table-top’ exercises and walk key staff through the process of dealing with a major incident. At the beginning of the Covid-19 pandemic, when most ARCHER2 staff moved to home working, the test team ran a workshop for Service Desk and CSE staff working through what would happen in the event of a major broadband-supplier outage (something similar to what happened to the Blackberry network, back in 2011).

The way staff react to BCDR tests is very interesting. Despite the obvious differences between a real incident and a test, we have found that staff quickly become engaged with and convinced by a scenario, behaving as if an incident is really happening. In early tests (for example, the burning building) staff commented that they had genuinely feelings of panic and stress. Thankfully, having completed a test, they also felt more ready for the unexpected.

image

Shortly after the burning building test, the UK faced the Beast from the East, when the country ground to a halt in the face of heavy and persistent snowfalls. Many ARCHER staff found themselves unable to travel into work and the ACF (where ARCHER is hosted) became inaccessible to normal vehicles. The ARCHER team quickly moved to distributed, online working. Experiences and lessons learned from the burning-building scenario meant people were better prepared, with home-working setups and awareness of the kinds of incident responses that were needed to keep ARCHER running.

Of course, it is critical that a test scenario does not grow into an actual real incident. The test team put in place contingencies to ensure this doesn’t happen, providing cover for key service elements, and alerting key stakeholders when tests are underway. The most recent test, which ran in October, focused on data-centre resilience at the ACF. This test ran in a table-top format, given Covid-19 restrictions, though involved key stakeholders from the ARCHER2 accommodation service, Edinburgh University estates, and security staff, playing out a computer-fire scenario in pseudo-real-time.

image

During an actual test, some of the BCDR Test Team take on the vital, if arduous, role of observers, with a responsibility for documenting the progress of the test and collecting live input from those involved. This can be done with notes, voice recordings, pictures, computer logs, etc., and may involve pre-planned intervention points. For example, during the food-poisoning scenario, the test team made calls to the Service Desk at several points, to check that everything was running. The data that the observers collect, during the day, is vital to help everyone reflect on the outcomes and to draw out important lessons and improvement actions.

It is very important to close down a test carefully, ensuring all staff know that the test is over, and that normal ARCHER2 service levels have resumed. It is also important to debrief staff and give them a chance to talk through their feelings. Scenarios can feel incredibly real, when you get engrossed in them and tend to take staff out of their comfort zone. It is equally important that the test team look after the health of the ARCHER2 staff, as it is the ARCHER2 service.

While business continuity may not seem the most exciting part of running a National HPC Service, organising and running a major-incident test is lots of fun and hopefully means ARCHER2 will continue to run whatever the winter weather may throw at us.