2024

Status Type Start End Quarter Scope User Impact Reason
Completed Full 2024-05-08 09:00 2024-05-08 21:00 Full ARCHER2 System Users will not be able to connect to the login nodes, jobs will not run and users will be unable to access data during this maintenance Replacement of operating system certificates
Completed Slurm 2024-05-01 09:00 2024-05-01 10:35 Slurm maintenance Running jobs will continue to run, but Slurm commands will be unavailable for a few minutes when the controller restarts. Required maintenance
Completed Partial 2024-03-07 09:00 2024-03-07 12:30 RDFaaS /epsrc and /general file systems Users will not be able to access data on /epsrc and /general during this maintenance Replacement of Power Supply Unit (PSU) on the RDFaas (E1000)
Completed Partial 2024-02-07 09:00 2024-02-12 14:30 RDFaaS /epsrc and /general file systems Users will not be able to access data on /epsrc and /general during this maintenance Updating the software on the RDFaas (E1000)
Completed Partial 2024-01-09 12:00 GMT 2024-01-18 12:00 GMT Q1 2024 ARCHER2 Users will be able to connect to ARCHER2 and access their data. Jobs will run but there will be several periods when users will be unable to submit jobs and new user jobs will not start. If you experience issues, please wait a few minutes and then try to submit the job again. Integrating the GPU nodes into ARCHER2
Completed Full 2024-01-08 09:00 2024-01-09 12:00 Q1 2024 ARCHER2 Users unable to connect to ARCHER2, existing queued jobs able to run from 20:00 GMT on 8 Jan 2024, users will not have access to data on ARCHER2. Integrating the GPU nodes into ARCHER2

2023

Status Type Start End Quarter Scope User Impact Reason
Completed Partial 18 September 2023 09:00 22 September 2023 11:55 Q3 2023 ARCHER2 No login access
No access to any data on the system
Jobs will continue to run, and queued jobs will be started as usual
Serial QoS will not be available
The SAFE will be available during the outage but there will be reduced functionality due to the unavailability of the connection to ARCHER2 such as resetting of passwords or new account creation.
Upgrade of network
Completed Partial 23 August 2023 10:00 23 August 2023 10:50 Q3 2023 ARCHER2 ARCHER2 users unable to submit new jobs For a few minutes, users will be unable to submit jobs whilst a roll out of a Slurm configuration change takes place.
This change will provide increased resilience and issue monitoring.
Complete Full 19 May 2023 14:00 12 June 2023 12:00 Q2 2023 ARCHER2 ARCHER2 unavailable to users Major software upgrade of ARCHER2. Full details in the ARCHER2 documentation

2022

Status Type Start End Quarter Scope User Impact Reason
Emergency Full 2022-10-17 09:00 2022-10-17 13:15 2022_q4 ARCHER2 ARCHER2 unavailable to users Slingshot interconnect reboot to allow the return of failed links which are causing job failures
Not Required Full 2022-03-30 2022-03-30 2022_q1 Scheduled maintenance Not Required Not Required
Completed: RFC0093 Partial : Login and Serial Nodes 2022-01-26 10:00 2022-01-26 12:20 2022_q1 ARCHER2 Login and Serial Nodes Users will be unable to connect to ARCHER2 and no jobs will run on the serial nodes To attach the ARCHER2 /home filesystem to a new network at the Advanced Computing Facility data centre

2021

Status Type Start End Quarter Scope User Impact Reason
Completed Full 2021-10-26 09:00 2021-10-26 17:00 2021_q4 ARCHER2 4Cabinet Users will be unable to run jobs and the /work filesystem will not be available Reboot of the High Speed Network (HSN). The River (support) rack was moved to a new protected power supply. The 4cabinet filesystem move to a protected power supply will be completed next week when additional power supplies are available.
Completed Partial: RDFaaS 2021-10-21 09:00 2021-10-21 17:00 2021_q4 RDFaaS: /epsrc and /general Users will be unable to access their files on the RDFaaS i.e. the /epsrc and /general filesystems. Software upgrade of E1000 System which hosts the RDFaaS
Completed Partial: RDFaaS 2021-10-18 09:00 2021-10-18 17:00 2021_q4 RDFaaS: /epsrc and /general Users will be unable to access their files on the RDFaaS i.e. /epsrc and /general filesystems. Upgrade and reconfiguration of high speed switches
Completed Partial: Compute Nodes 2021-10-01 09:52 2021-10-01 15:44 2021_q4 ARCHER2 4Cabinet: Compute Nodes Users will be able to connect to User Access Nodes and will be able to submit jobs to the compute nodes. The queued jobs will start once the compute nodes are returned to service. A power issue at a substation local to the Advanced Computing Facility (ACF).
Completed Partial: Compute Nodes 2021-09-30 08:30 2021-09-30 11:30 2021_q3 ARCHER2 4Cabinet: Compute Nodes Users will be able to connect to User Access Nodes and will be able to submit jobs to the compute nodes. The queued jobs will start once the compute nodes are returned to service. A switch within the 4 Cabinet Service requires a reboot
Completed Full (took place within unplanned outage) 2021-09-15 09:00 2021-09-15 16:00 2021_q3 ARCHER2 4Cabinet: User Access and Compute Nodes Users will not be able to connect. Jobs can be queued and will start once the service returns Apply a fix for Singularity Issue.
Completed Unplanned Full 2021-09-14 11:00 2021-09-15 16:00 2021_q3 ARCHER2 4Cabinet: User Access and Compute Nodes Users will not be able to connect. Power Issues within the Edinburgh area
Completed At-risk 2021-09-07 2021-09-09 2021_q3 ARCHER2 4Cabinet: User Access and Compute Nodes Momentary interruptions to connections to UANs Allow our HPC Systems team to move the ARCHER2 4 cabinet system to a new Network at the Advanced Computing Facility (ACF)
Completed Full 2021-08-25 14:00 2021-08-26 11:15 2021_q3 ARCHER2 4Cabinet: User Access and Compute Nodes No access to the UANs and the queues will start to drain on the compute nodes from Monday 23rd August at 1400 Allow HPE Systems team to apply an essential security patch to the ARCHER2 4 cabinet system
Completed At-risk 2021-08-25 2021-08-25 2021_q3 ARCHER2 4Cabinet: User Access and Compute Nodes Momentary loss of external network traffic to the User Access Nodes (UAN) on ARCHER2 Allow our HPC Systems team to move the ARCHER2 4 cabinet system to a new Network at the Advanced Computing Facility (ACF)
Completed At-risk 2021-08-18 10:00 2021-08-18 15:00 2021_q3 ARCHER2 4Cabinet: User Access and Compute Nodes Connection to the User Access Nodes (UAN) on ARCHER2 may be lost. File transfers may be affected. Jobs running on the compute nodes will not be impacted Allow our HPC Systems team to move the ARCHER2 4 cabinet system to a new Network at the Advanced Computing Facility (ACF)
Completed Unplanned 2021-07-28 12:00 2021-07-28 15:00 2021_q3 ARCHER2 4Cabinet: Compute and file system Running jobs failed, no jobs able to start, filesystem unavailable Power issue at ACF
Completed Unplanned 2021-07-20 2021-07-22 2021_q3 ARCHER2 4Cabinet: User Access and Compute Nodes Prevented new jobs from starting on the system to reduce the impact on users. Some running jobs may also have crashed as a result of this issue but any currently running should be unaffected Issue with the interconnect on the ARCHER2 service that causes some new jobs to fail on MPI initialisation
Completed Full 2021-06-29 06:00 2021-06-29 23:00 2021_q2 ARCHER2 4Cabinet: Full system System unavailable to users Essential power work at the ACF
Completed Full 2021-04-28 09:00 2021-04-28 11:00 2021_q2 ARCHER2 4Cabinet: Full system Users were able to access data and the User Access Nodes (UANs) throughout the maintenance session. Installing this PE required a reboot of the compute nodes. A new version of the HPE Cray Programming Environment was installed to address memory leaks that were affecting a significant number of users and to help users prepare for the main ARCHER2 system.
Completed Full 2021-03-18 09:00 2021-03-18 12:30 2021_q1 ARCHER2 4Cabinet System unavailable to users High Speed Network (HSN) rebooted to allow the return of failed links which were causing job failures
Completed Full 2021-02-07 21:15 2021-02-18 14:00 2021_q1 ARCHER2 4Cabinet System unavailable to users Updated the system to software v1.3.3, this included patched for the critical ‘sudo’ vulnerability
Completed Unplanned 2021-02-07 21:15 2021-02-09 09:30 2021_q1 ARCHER2 4Cabinet System unavailable to users Power outage affecting SE Scotland
Completed Full 2021-02-04 08:00 2021-02-04 14:00 2021_q1 ARCHER2 4Cabinet System unavailable to users High Speed Network (HSN) rebooted to allow the return of failed links which were causing job failures