2023

Status Type Start End Quarter Scope User Impact Reason
Planned Full 2023-04-14 12:00 Q2 2023 ARCHER2 ARCHER2 unavailable to users Major software upgrade of ARCHER2. Service return expected w/b 8 May 2023. Full details in the ARCHER2 documentation

2022

Status Type Start End Quarter Scope User Impact Reason
Emergency Full 2022-10-17 09:00 2022-10-17 13:15 2022_q4 ARCHER2 ARCHER2 unavailable to users Slingshot interconnect reboot to allow the return of failed links which are causing job failures
Not Required Full 2022-03-30 2022-03-30 2022_q1 Scheduled maintenance Not Required Not Required
Completed: RFC0093 Partial : Login and Serial Nodes 2022-01-26 10:00 2022-01-26 12:20 2022_q1 ARCHER2 Login and Serial Nodes Users will be unable to connect to ARCHER2 and no jobs will run on the serial nodes To attach the ARCHER2 /home filesystem to a new network at the Advanced Computing Facility data centre

2021

Status Type Start End Quarter Scope User Impact Reason
Completed Full 2021-10-26 09:00 2021-10-26 17:00 2021_q4 ARCHER2 4Cabinet Users will be unable to run jobs and the /work filesystem will not be available Reboot of the High Speed Network (HSN). The River (support) rack was moved to a new protected power supply. The 4cabinet filesystem move to a protected power supply will be completed next week when additional power supplies are available.
Completed Partial: RDFaaS 2021-10-21 09:00 2021-10-21 17:00 2021_q4 RDFaaS: /epsrc and /general Users will be unable to access their files on the RDFaaS i.e. the /epsrc and /general filesystems. Software upgrade of E1000 System which hosts the RDFaaS
Completed Partial: RDFaaS 2021-10-18 09:00 2021-10-18 17:00 2021_q4 RDFaaS: /epsrc and /general Users will be unable to access their files on the RDFaaS i.e. /epsrc and /general filesystems. Upgrade and reconfiguration of high speed switches
Completed Partial: Compute Nodes 2021-10-01 09:52 2021-10-01 15:44 2021_q4 ARCHER2 4Cabinet: Compute Nodes Users will be able to connect to User Access Nodes and will be able to submit jobs to the compute nodes. The queued jobs will start once the compute nodes are returned to service. A power issue at a substation local to the Advanced Computing Facility (ACF).
Completed Partial: Compute Nodes 2021-09-30 08:30 2021-09-30 11:30 2021_q3 ARCHER2 4Cabinet: Compute Nodes Users will be able to connect to User Access Nodes and will be able to submit jobs to the compute nodes. The queued jobs will start once the compute nodes are returned to service. A switch within the 4 Cabinet Service requires a reboot
Completed Full (took place within unplanned outage) 2021-09-15 09:00 2021-09-15 16:00 2021_q3 ARCHER2 4Cabinet: User Access and Compute Nodes Users will not be able to connect. Jobs can be queued and will start once the service returns Apply a fix for Singularity Issue.
Completed Unplanned Full 2021-09-14 11:00 2021-09-15 16:00 2021_q3 ARCHER2 4Cabinet: User Access and Compute Nodes Users will not be able to connect. Power Issues within the Edinburgh area
Completed At-risk 2021-09-07 2021-09-09 2021_q3 ARCHER2 4Cabinet: User Access and Compute Nodes Momentary interruptions to connections to UANs Allow our HPC Systems team to move the ARCHER2 4 cabinet system to a new Network at the Advanced Computing Facility (ACF)
Completed Full 2021-08-25 14:00 2021-08-26 11:15 2021_q3 ARCHER2 4Cabinet: User Access and Compute Nodes No access to the UANs and the queues will start to drain on the compute nodes from Monday 23rd August at 1400 Allow HPE Systems team to apply an essential security patch to the ARCHER2 4 cabinet system
Completed At-risk 2021-08-25 2021-08-25 2021_q3 ARCHER2 4Cabinet: User Access and Compute Nodes Momentary loss of external network traffic to the User Access Nodes (UAN) on ARCHER2 Allow our HPC Systems team to move the ARCHER2 4 cabinet system to a new Network at the Advanced Computing Facility (ACF)
Completed At-risk 2021-08-18 10:00 2021-08-18 15:00 2021_q3 ARCHER2 4Cabinet: User Access and Compute Nodes Connection to the User Access Nodes (UAN) on ARCHER2 may be lost. File transfers may be affected. Jobs running on the compute nodes will not be impacted Allow our HPC Systems team to move the ARCHER2 4 cabinet system to a new Network at the Advanced Computing Facility (ACF)
Completed Unplanned 2021-07-28 12:00 2021-07-28 15:00 2021_q3 ARCHER2 4Cabinet: Compute and file system Running jobs failed, no jobs able to start, filesystem unavailable Power issue at ACF
Completed Unplanned 2021-07-20 2021-07-22 2021_q3 ARCHER2 4Cabinet: User Access and Compute Nodes Prevented new jobs from starting on the system to reduce the impact on users. Some running jobs may also have crashed as a result of this issue but any currently running should be unaffected Issue with the interconnect on the ARCHER2 service that causes some new jobs to fail on MPI initialisation
Completed Full 2021-06-29 06:00 2021-06-29 23:00 2021_q2 ARCHER2 4Cabinet: Full system System unavailable to users Essential power work at the ACF
Completed Full 2021-04-28 09:00 2021-04-28 11:00 2021_q2 ARCHER2 4Cabinet: Full system Users were able to access data and the User Access Nodes (UANs) throughout the maintenance session. Installing this PE required a reboot of the compute nodes. A new version of the HPE Cray Programming Environment was installed to address memory leaks that were affecting a significant number of users and to help users prepare for the main ARCHER2 system.
Completed Full 2021-03-18 09:00 2021-03-18 12:30 2021_q1 ARCHER2 4Cabinet System unavailable to users High Speed Network (HSN) rebooted to allow the return of failed links which were causing job failures
Completed Full 2021-02-07 21:15 2021-02-18 14:00 2021_q1 ARCHER2 4Cabinet System unavailable to users Updated the system to software v1.3.3, this included patched for the critical ‘sudo’ vulnerability
Completed Unplanned 2021-02-07 21:15 2021-02-09 09:30 2021_q1 ARCHER2 4Cabinet System unavailable to users Power outage affecting SE Scotland
Completed Full 2021-02-04 08:00 2021-02-04 14:00 2021_q1 ARCHER2 4Cabinet System unavailable to users High Speed Network (HSN) rebooted to allow the return of failed links which were causing job failures