2022

Status Type Start End Scope User Impact Reason
Resolved Service Alert 2022-08-11 12:30 2022-08-12 11:30 Compute nodes Increased queue times and reduced node availability All nodes now returned to service. Due to high temperatures in the Edinburgh area and to ease the load on the cooling system some nodes were removed from service. Users could connect to ARCHER2, access data and submit jobs to the batch system.
Resolved Service Alert 2022-07-19 14:00 2022-07-19 18:30 Full service Login access not available, no jobs allowed to start, running jobs will have failed An internal DNS failure on the system has stopped components communicating
Resolved Service Alert 2022-07-18 14:00 2022-07-20 10:00 Compute nodes Increased queue times and reduced node availability Reduced number of compute nodes available as high temperatures in the Edinburgh area are creating cooling issues. Around 4700 compute nodes are currently available and we are continuing to monitor. Full service restored at 10:00 20th.
Resolved Issue 2022-07-14 22:00 2022-07-15 13:50 Full ARCHER2 system No access to system A significant power outage affecting large areas of the Lothians caused ARCHER2 to shut down around 10pm on 14th July.
Our team are on site and working to restore service as soon as possible.
Resolved At-Risk 2022-07-06 09:00 2022-07-06 16:00 /home filesystem and ARCHER2 nodes No user impact expected Test new configuration for /home filesystem for connection to PUMA
Resolved At-Risk 2022-06-29 10:00 2022-06-29 12:00 Four cabinets will be removed from service and then returned to service Reduced number of compute nodes available to users Successfully tested the phased approach for future planned work
Resolved Service Alert 2022-06-15 20:30 2022-06-16 16:30 Compute nodes in cabinets 21-23 734 compute nodes were unavailable to users Power issue with cabinets 21-23
Resolved At-Risk 2022-06-08 09:00 2022-06-13 16:00 Installation of new Programming Environment (22.04) No user impact is expected. The new PE will not be the default module so users will need to load it if they wish to use the latest version. New updated PE is available with several improvements and bug fixes.
Resolved Issue 2022-05-25 11:30 2022-05-25 12:15 Issue with Slurm accounting DB. Some users are seeing issues with submitting jobs or using sacct commands to query accounting records. HPE systems have resolved this issue.
Resolved Issue 2022-05-24 14:40 2022-05-24 20:40 Login nodes, compute nodes, running jobs Users may experience issues with interactive access on login nodes. Running jobs may have failed. No new jobs will be allowed to start. Investigations on the underlying issue are ongoing.
Resolved Issue 2022-05-19 09:00 2022-05-20 16:00 A hardware issue with one of the ARCHER2 worker nodes has caused a number of the compute nodes to go into 'completing' mode. These compute nodes require a reboot. Some compute nodes are unavailable as reboots are performed. Some user jobs will not complete fully and other may fail. These jobs should not be charged but please contact support@archer2.ac.uk if you want to confirm whether a refund is required. Hardware issue with a worker node.
Resolved Issue 2022-05-09 09:00 2022-05-23 12:00 The Slurm batch scheduler will be upgraded. In total the work will take around a week but user impact is expected to be limited to 6 hours when users will not be able to submit new jobs and new jobs will not start Wednesday 11th May 1000 – 1715 (users notified when work was completed) and Monday 23rd May 1000 – 1200
Running jobs will not be impacted but users will not be able to submit new jobs and new jobs will not start.
Updating the Slurm software.
Resolved Issue 2022-03-30 10:00 2022-03-30 11:00 The Slurm batch scheduler will be updated with a new fair share policy. Running jobs will not be impacted but users will not be able to submit new jobs. If users are impacted, they should wait and then resubmit the job once the work has completed. Updating the Slurm configuration.
Resolved Issue 2022-03-30 11:00 2022-03-30 14:20 Slurm scheduler Users may see issues with submitting jobs: `Batch job submission failed: Requested node configuration is not available` and `srun: error: Unable to create step for job 1351277: More processors requested than permitted` which will see jobs not submit or fail at runtime. Part of this morning's update has been rolled back and we believe this issue is thus now resolved.
Resolved Issue 2022-03-24 15:00:00 +0000 2022-03-24 15:45:00 +0000 67 nodes within cabinets 20-23 Reduced number of compute nodes being available which may cause longer queue times An issue has occurred while maintenance work was taking place on cabinets 20-23
Resolved Issue 2022-03-02 10:00:00 +0000 2022-03-02 10:15:00 +0000 The DNS Server will be moved to a new server. The outgoing connections from ARCHER2 may be affected for up to five minutes. Updating of the DNS Server network.
Resolved Issue 2022-02-22 09:30:00 +0000 2022-02-23 10:00:00 +0000 One ARCHER2 cabinet removed from service. A replacement PDU is required for an ARCHER2 cabinet. Once replaced, the cabinet will be returned to service. A cabinet of compute nodes are unavailable.
Resolved Downtime 2022-02-21 15:00:00 +0000 2022-02-22 09:30:00 +0000 Work (/work) Lustre file systems. Compute nodes. Access to ARCHER2 has been disabled so users will not be able to log on. Running jobs may have failed. No new user jobs will be allowed to start. Work file systems performance issues. A large number of compute nodes are unavailable.
Resolved Issue 2022-02-04 12:05:00 +0000 2022-02-03 12:34:00 +0000 Login nodes home file system unavailable and consequentially prevented or significantly slowed logins network connectivity between the frontend and HomeFS was lost
Resolved Issue 2022-02-03 14:30:00 +0000 2022-02-03 15:00:00 +0000 Login nodes Users may be unable to connect to the login nodes Planned reboot of switch caused loss of connection to login nodes
Resolved Downtime 2022-02-02 11:30:00 +0000 2022-02-02 12:05:00 +0000 Login issue Temporary issue with users unable to connect to login nodes short outage on the ldap authentication server while maintenance work took place
Resolved Issue 2022-01-27 14:30:00 +0000 2022-01-27 15:50:00 +0000 Login and data analysis nodes Loss of connection to ARCHER2 for any logged in users, an inability to login for users that were not connected, and jobs running on the data analysis nodes ("serial") partition failed. Login and data analysis nodes needed to be reconfigured
Resolved Issue 2022-01-24 17:30:00 +0000 2022-01-25 14:30:00 +0000 RDFaaaS Users cannot access their data in /epsrc or /general on the RDFaaS from ARCHER2 Systems team are investigating intermittent loss of connection between ARCHER2 and RDFaaS
Resolved Issue 2022-01-20 11:40:00 +0000 2022-01-20 15:20:00 +0000 login nodes Login node 4 was currently the only login node available - users could use login4.archer2.ac.uk directly to access the system Systems team are investigating
Resolved Notice 2022-01-10 12:00:00 +0000 2022-01-10 12:00:00 +0000 4-cabinet system Users can no longer use the ARCHER2 4-cabinet system or access /work data on the 4-cabinet system The ARCHER2 4-cabinet system was removed from service as planned
Resolved Issue 2022-01-09 10:00:00 +0000 2022-01-10 10:30:00 +0000 Login and Data Analysis nodes, SAFE Outgoing network access from ARCHER2 systems to external sites was not working. SAFE response was slow or degraded. DNS issue at datacentre
Resolved Issue 2021-12-31 10:00:00 +0000 2022-01-10 10:00:00 +0000 Compute nodes Running jobs on down nodes failed, reduced number of compute nodes available A number of compute nodes are unavailable due to hardware issue

2021

Status Type Start End Scope User Impact Reason
Resolved Issue 2021-12-20 13:00:00 +0000 2021-12-20 14:12:00 +0000 4-cabinet service Slurm scheduler Response from Slurm scheduler was degraded, running jobs unaffected Slurm scheduler was experiencing issues
Resolved Issue 2021-12-16 04:05:00 +0000 2021-12-16 09:41:00 +0000 Slurm scheduler unavailable, outgoing connections failed Slurm commands did not work, outgoing connections did not work, running jobs continued without issue Spine switch became unresponsive
Resolved Issue 2021-11-14 08:00:00 +0000 2021-11-15 09:40:00 +0000 4-cabinet system monitoring Load chart on website missing some historical data A login node issue caused data to not be collected
Resolved Issue 2021-11-05 08:00:00 +0000 2021-11-05 11:30:00 +0000 4-cabinet system compute nodes A large number of compute nodes were unavailable for jobs A power incident in the Edinburgh area caused a number of cabinets to lose power
Resolved Issue 2021-11-03 14:45:00 +0000 2021-11-03 14:45:00 +0000 4-cabinet system login access The login-4c.archer2.ac.uk address is unreachable so logins via this address will fail, users can use the address 193.62.216.1 instead A short network outage at the University of Edinburgh caused issues with resolving the ARCHER2 login host names
Resolved Issue 2021-11-01 09:30:00 +0000 2021-11-01 11:30:00 +0000 4-cabinet system compute nodes There are 285 nodes down so queue times may be longer A user job hit a known bug and brought down 256 compute nodes
Resolved Issue 2021-11-01 09:30:00 +0000 2021-11-01 11:30:00 +0000 4-cabinet system compute nodes There were 285 nodes down so queue times may be longer A user job hit a known bug and brought down 256 nodes