BCDR Test 2023


What happens once every two years, no one quite knows when, though there is a lot of speculation, and no one know what form it will take? The EPCC ARCHER2 business continuity and disaster recover (BCDR) test.

This tests how well the ARCHER2 team recognises a major scenario, responds to it and what lessons can be learned from the process. It is also a lot of fun for at least some involved – cue evil cackles from the team who devise and run the whole thing!

In previous years tests have ranged from fires in buildings (both the office building and the datacentre), outbreaks of food poisoning after a staff party and the failure of a major broadband provider whilst we were working from home during COVID. With all the tests we have learned valuable lessons and they have allowed us to prepare for real major incidents when they hit and more importantly to put measures in place that will help prevent such incidents happening.

So what happened this time? The test team had decided on a cybersecurity scenario, and chose the leak of personal user data out on to the internet. This seemed timely with the recent cybersecurity and ransomware incidents that have occurred at other institutions. The first hint that this was BCDR test day for the wider ARCHER2 team was an email to the service desk (with BCDR Test in the title) from an imaginary user saying that they had found their personal data out on the internet.

image

A major incident meeting was called and the management team started working through the major incident process. The test team dropped other injection points in as the day went on, contact from a journalist wanting a statement for a national newspaper, a ransomware request from a previous and disgruntled employee and the like. We also involved the University Security and Information Services teams in the exercise to see how our planned actions would fit with the processes and policies of the wider University in such a scenario. UKRI were also involved as funders of the ARCHER2 service to look at how such a scenario might impact them. Throughout the exercise the test team made sure that the ARCHER2 service kept running and was unaffected in real like. Once the test was complete we carried out a lessons learned exercise to look at how well we handled the test and what improvements could be made.

So what did we learn? On the less serious side that to keep the management team happy a lunch break should have been scheduled during the exercise or refreshments provided! There was a technical element to the scenario, but much of the work it generated was around communications and interaction with the wider University. It gave the wider University greater visibility of our processes and the services we run. We identified improvement actions both technical and procedural, and further training that is required. We hope that we never run into such a scenario in real life but once the improvement actions are complete, it is a less likely for it to happen to the ARCHER2 service, and we are little better equipped to deal with it and minimise user disruption should it happen.