Resolved Service Alerts

Status	Type	Start	End	Scope	User Impact	Reason
Resolved	Issue	2025-09-30 14:00	2025-09-30 14:30	Slurm controller restart	Testing new configuration with power monitoring. Testing should take around 15-30mins to complete. Whilst this is happening users will be unable to submit jobs or query job status.	Add power monitoring functionality
Resolved	Service Alert	2025-09-25 08:30 BST	2025-09-26 18:00 BST	Cooling system	Small risk of service interruption	Essential work on the cooling infrastructure that supports ARCHER2
Resolved	Service Alert	2025-09-23 08:30 BST	2025-09-27 08:30 BST	All parallel jobs launched using srun	All parallel jobs launched using `srun` will have their IO profile captured by the Darshan IO profiling tool. In rare cases this may cause jobs to fail or impact performance. Users can disable Darshan by adding the line `module remove darshan` before they use `srun` in their job submission scripts.	Capturing data on the IO use on ARCHER2 to improve the service.
Resolved	Service Alert	2025-09-17 14:00	2025-09-18 15:20	RDFaaS	RDFaaS file systems unavailable (/epsrc, /general)	Underlying storage systems are being updated
Resolved	Service Alert	2025-09-17 14:00	2025-10-01 12:20	GPU nodes	GPU nodes are unavailable	HPE are working to bring the GPU nodes back into service
Resolved	Service Alert	2025-09-17 14:00	2025-09-18 10:30	Persistent service node	No licence servers available - Linaro Forge, QChem and Rolls-Royce software will not work	Persistent service node is currently unavailable
Resolved	Service Alert	2025-08-25 14:30	2025-08-25 20:00	Reduced number of compute nodes	Increased queue times and reduced node availability	We stopped new jobs from starting and are monitoring the cooling closely during unusually forecasted high temperatures in the Edinburgh area.
Resolved	Service Alert	2025-08-18 11:00	2025-08-18 18:00	Resolution of external hostnames from ARCHER2	Users will not be able to resolve external host names from ARCHER2 nodes for a short time	Testing backup DNS for data centre downtime later in the year
Resolved	Service Alert	2025-08-12 12:00	2025-08-14 22:00	Reduced number of compute nodes	Increased queue times and reduced node availability	We are monitoring the cooling closely during unusually forecasted high temperatures in the Edinburgh area. We have stopped new jobs from starting and powering off nodes until the cooling is within operating limits.
Resolved	Service Alert	2025-08-11 13:50	2025--08-11 14:17	SAFE, MFA at login	Login not accessible, SAFE not accessible	Due to work on SAFE database, SAFE and ARCHER2 login MFA are currently unavailable
Resolved	Issue	2025-08-07 14:00	2025-08-07 14:15	Slurm controller restart	The slurm controller will be restarted and will take approximately 10 minutes to complete. Whilst this is happening users will be unable to submit jobs or query job status.	In order to try and resolve an ongoing issue
Resolved	Service Alert	2025-08-04 12:30	2025-08-05 17:00	Compute nodes	Increased queue times and reduced node availability	1200 compute nodes reserved for testing of updated node image
Resolved	Service Alert	2025-07-11 10:00	2025-07-14 08:00	Compute nodes	Increased queue times and reduced node availability	Update 2025-07014 All nodes have been returned to service. We are planning to remove a number of compute nodes to ensure cooling is adequate as higher temperatures are forecast in the Edinburgh area. Further details will be provided as they are available. We apologise for the inconvenience caused by longer queue times.
Resolved	Service Alert	2025-07-09 15:30 BST	2025-07-11 08:40 BST	All parallel jobs launched using srun	All parallel jobs launched using `srun` will have their IO profile captured by the Darshan IO profiling tool. In rare cases this may cause jobs to fail or impact performance. Users can disable Darshan by adding the line `module remove darshan` before they use `srun` in their job submission scripts.	Capturing data on the IO use on ARCHER2 to improve the service.
Resolved	Service Alert	2025-06-30 13:30	2025-06-30 21:00	Compute nodes	Increased queue times and reduced node availability	Update 2100 All nodes returned to service. Temperatures within cooling loop are good and the forecast will be fine over night. Update 1800 - 500 nodes remain out of service. 245 will be released at 2000 BST and 255 ndoes will be released at 2100 BST. A number of compute nodes will be removed from service to ensure cooling is adequate as higher temperatures are forecast in the Edinburgh area. The short queue is available. We apologise for the inconvenience caused by longer queue times.
Resolved	Issue	2025-06-26 13:00 BST	2025-06-26 13:40 BST	Slurm controller restart between 13:00 and 14:00	The slurm controller will be restarted at some point between 13:00 and 14:00 today, 26/06/25 and will take approximately 10 minutes to complete. Whilst this is happening users will be unable to submit jobs or query job status.	In order to try and resolve an ongoing issue
Resolved	Service Alert	2025-06-25 08:00	2025-06-25 17:10	Compute nodes	Increased queue times and reduced node availability. Possible intermittent issues with file system or internode communication due to change in interconnect topology while cabinets and switches are unavailable.	Pump replacement on 3 cabinets which will be removed from service while pump replacement takes place.

Status	Type	Start	End	Scope	User Impact	Reason
Resolved	Issue	2025-06-23 14:41 BST		a2fs-work1 Lustre file system	A hardware issue has been identified on part of the ARCHER2 work1 filesystem. No jobs allowed to start for projects hosted on this file system.	Hardware issue with work1 filesystem

Status	Type	Start	End	Scope	User Impact	Reason
Resolved	Service Alert	2025-06-20 11:00	2025-06-23 08:30	40% - 50% Compute nodes	Increased queue times and reduced node availability	Update An issue was identified which caused the temperature to remain high. This has now been resolved and nodes are being returned to serivce. We have powered down 40% of compute nodes to ensure cooling is adequate due to unusually high temperatures in the Edinburgh area. We have placed a further 10% of compute nodes into maintenance mode. The maintenance will be lifted at 2200 on Saturday 21st June if cooling is at a good level. A further assessment will take place at 0900 on Sunday 22nd June and if possible nodes will be returned to service. We apologise for the inconvenience caused by longer queue times.
Resolved	Service Alert	2025-06-05 10:00	2025-06-05 10:45	ARCHER2 Slurm scheduler	ARCHER2 Slurm Controller will be restarted. Running jobs will continue to run, but Slurm commands will be unavailable for a few minutes.	Update to the Slurm configuration.
Resolved	Service Alert	2025-05-20 11:06	2025-05-20 11:21	Login, compute, GPU and serial nodes, a2fs-work4 file system	Login access may have stalled, loading modules may have failed on nodes or in jobs. Reading or writing data to a2fs-work4 Lustre file system may have failed.	Hardware issue on a2fs-work4 file system
Resolved	Service Alert	2025-05-05 09:00	2025-05-05 15:30	ARCHER2 compute nodes	Update: Issue is now resolved and maintenance removed from system. HPE bringining the four cabinets back which had been impacted. Some ARCHER2 compute nodes are not available while investigations take place. Users can connect to login nodes and can access data. No new jobs will start.	There is an issue with the cooling on ARCHER2.
Resolved	Service Alert	2025-03-25 10:15	2025-03-25 11:35	ARCHER2 compute nodes	Update: Nodes restored, work now restarting. All ARCHER2 compute nodes are unavailable. Jobs running at time of incident will have failed. No new jobs will start.	Failed server.
Resolved	Service Alert	2025-03-13 10:00	2025-03-17 17:00	E1000 at-risk	None expected but there may be possible impact on ability to access the /epsrc and /general file systems	Maintenance required
Resolved	Service Alert	2025-02-12 09:00	2025-02-12 11:30	RDFaaS (file systems /epsrc and /general)	No user impact expected	To allow the vendor to complete essential work on the E1000
Resolved	Service Alert	2025-02-03 08:00	2025-02-03 16:00	RDFaaS (file systems /epsrc and /general)	Users will not be able to access files within /epsrc and /general	Essential work on the RDFaaS hardware
Resolved	Service Alert	2025-01-28 11:30	2025-01-28 15:00	ARCHER2 service	Some users are reporting odd messages at logon "dbus-update-activation-environment: error: unable to connect to D-Bus: /usr/bin/dbus-launch terminated abnormally with the following error: Autolaunch requested, but X11 support not compiled in. Cannot continue. "	Now resolved
Resolved	Service Alert	2025-01-27 12:00	2025-01-29 12:00	ARCHER2 service	Service at higher risk of disruption than usual	Work to repair power grid following storm in Edinburgh area means that power issues more likely during this period.
Resolved	Service Alert	2025-01-24 10:00	2025-01-24 17:00	ARCHER2 service	Service at higher risk of disruption than usual. If issues arise, service may take longer to restore.	Red weather warning for high winds in Edinburgh area lead to travel restrictions and higher than usual risk of power/building damage issues.
Resolved	Service Alert	2025-01-24 12:45	2025-01-27 11:30	ARCHER2 compute nodes	All ARCHER2 compute nodes are unavailable. Jobs running at time of incident will have failed. No new jobs will start and Short QoS is unavailable.	Extreme weather in Edinburgh area has interrupted power supply at ACF data centre. Compute nodes will remain unavailable until extreme weather has passed and staff are able to restore them to service.
Resolved	Service Alert	2025-01-22 09:15 GMT	2025-01-22 13:15 GMT	All parallel jobs launched using srun	All parallel jobs launched using `srun` will have their IO profile captured by the Darshan IO profiling tool. In rare cases this may cause jobs to fail or impact performance. Users can disable Darshan by adding the line `module remove darshan` before they use `srun` in their job submission scripts.	Capturing data on the IO use on ARCHER2 to improve the service.
Resolved	Service Alert	2025-01-22 14:45	2025-01-22 18:00	ARCHER2 login, compute and data analysis nodes	No login access, jobs running at time of failure will have failed	Failure on solid state Lustre file system
Resolved	Service Alert	2025-01-15 11:45	2025-01-16 09:20	ARCHER2 login, compute and data analysis nodes	No login access, jobs running at time of failure will have failed	Failure on solid state Lustre file system
Resolved	Service Alert	2025-01-16 13:00	2025-01-16 14:00	ARCHER2 software modules	Changes to module loaction, remove old versions, update default versions	Periodic software update
Resolved	Service Alert	2025-01-09 12:30 GMT	2025-01-10 08:10 GMT	a2fs-work2 Lustre file system	A hardware issue has been identified on part of the ARCHER2 work2 filesystem. No jobs allowed to start for projects hosted on this file system.	Hardware issue with work2 filesystem

Status	Type	Start	End	Scope	User Impact	Reason
Resolved	Issue	2024-12-11 09:00	2024-12-11 16:00	RDFaaS (file systems /epsrc and /general)	We do not expect any user impact	Vendor is working on the RDFaaS hardware but should not impact user service
Resolved	Service Alert	2024-12-01 15:00	2024-12-02 13:30	Subset of ARCHER2 compute nodes	Around 1,000 compute nodes are unavailable. Jobs running on the affected nodes at the time of failure will have crashed. The rest of system remains available and jobs on unaffected nodes continue as usual.	Incorrect Loss of power to some ARCHER2 cabinets
Resolved	Service Alert	2024-11-25 08:00 GMT	2024-11-26 08:00 GMT	All parallel jobs launched using srun	All parallel jobs launched using `srun` will have their IO profile captured by the Darshan IO profiling tool. In rare cases this may cause jobs to fail or impact performance. Users can disable Darshan by adding the line `module remove darshan` before they use `srun` in their job submission scripts.	Capturing data on the IO use on ARCHER2 to improve the service.
Resolved	Service Alert	2024-11-21 11:40 GMT	2024-11-21 12:00 GMT	Access to ARCHER2 from outside University of Edinburgh	No users will be able to access ARCHER2 from outside the University of Edinburgh network. Running/queued jobs are unaffected.	Loss of power to part of the communication network
Resolved	Service Alert	2024-11-20 16:00 GMT	2024-11-21 16:00 GMT	a2fs-work3 Lustre file system	Update 2024-11-20 15:30 - the faulty disk has been replaced and we expect to see response times returning to normal now. Users using the a2fs-work3 Lustre file system may see slow or intermittent response when accessing data	Suspected slow disk in one of the Lustre OSTs
Resolved	Service Alert	2024-11-14 08:00 GMT	2024-11-15 08:00 GMT	All parallel jobs launched using srun	All parallel jobs launched using `srun` will have their IO profile captured by the Darshan IO profiling tool. In rare cases this may cause jobs to fail or impact performance. Users can disable Darshan by adding the line `module remove darshan` before they use `srun` in their job submission scripts.	Capturing data on the IO use on ARCHER2 to improve the service.
Resolved	Issue	2024-10-28 11:15	2024-10-28 12:15	ARCHER2 SAFE	SAFE DNS name will be migrated - SAFE and TOTP availability at-risk	Migration to new SAFE Containers
Resolved	Issue	2024-10-24 09:30	2024-10-24 11:40	Slurm database maintenance	Running jobs will continue to run, but no new jobs will be able to start while the maintenance is conducted. Submitting to the short QoS will fail during the maintenance.	Updating the Slurm implementation.

Status	Type	Start	End	Scope	User Impact	Reason
Resolved	Service Alert	2024-10-03 15:30		ARCHER2 Heavy load on fs2	We are seeing heavy load on fs2, impacting compilations and accessing files.	Our team are investigating.

Status	Type	Start	End	Scope	User Impact	Reason
Resolved	Service Alert	2024-10-01 09:45	2024-10-01 12:15	ARCHER2 work (fs1) file system	Slow response to access data on fs1 work file system. `module` commands show slow response. New work was stopped, but is now being started once more (11.45).	fs1 issues now resolved - please contact the service desk if you see any further problems.
Resolved	Service Alert	2024-09-30 0630	2024-09-30 1400	ARCHER2 work (fs4) file system	Some data or directories on the file system may be inaccessible. Trying to access inaccessible data may cause the terminal to hang.	An OSS has failed and failover did not happen successfully
Resolved	Service Alert	2024-09-25 16:00	2024-09-26 15:00	Slurm scheduler	Intermittent issues running Slurm commands
Resolved	Service Alert	2024-09-24 15:30	2024-09-24 17:00	ARCHER2 work (fs1) file system	Slow response to access data on fs1 work file system. `module` commands show slow response.	Contention for file system resources
Resolved	Service Alert	2024-09-24 08:00	2024-09-26 14:00	ARCHER2 queues	Users may observe slightly longer queue times for other work while some nodes are reserved for the Capability QoS.	ARCHER2 Capability Days The third ARCHER2 Capability Days session will run from 24-26 September 2024.
Resolved	Issue	2024-08-02 09:00	2024-08-09 17:00	RDFaaS (file systems /epsrc and /general) DMF tape backup system	The tape backup service will be unavailble for the week. This means that no new data will be backed up during the week so there is a small risk that new data could be lost during this week. Once the service is resumed a catch up back up will take place which means all data will then be backed up.	Physical moving of DMF tape drive at the Advanced Computing Facility (ACF)
Resolved	Service Alert	2024-07-18 20:30	2024-07-19 11:54	ARCHER2 work (fs3) file system	Issues accessing data on work (fs3) file system	Normal service has been restored
Resolved	Issue	2024-06-20 12:30	2024-06-20 12:40	The Slurm batch scheduler will be updated with a new certificate	Running jobs will not be impacted but users will not be able to submit new jobs for a brief period (around 5-10 minutes). If users are impacted, they should wait and then resubmit the job once the work has completed.	Updating the Slurm certificate.
Resolved	Service Alert	2024-06-05 14:15	2024-06-05 20:40	ARCHER2 Capability Days compute nodes	Capability Days jobs not starting	Previous Capability Days job I/O caused nodes to fail
Resolved	Service Alert	2024-06-05 14:15	2024-06-05 20:40	ARCHER2 work (fs3) file system	Issues accessing data on work (fs3) file system	Failure of Lustre server node and automatic failover of Lustre server node did not succeed
Resolved	Service Alert	2024-06-04 08:00	2024-06-06 14:00	ARCHER2 queues	Users may observe slightly longer queue times for other work while some nodes are reserved for the Capability QoS.	ARCHER2 Capability Days The second ARCHER2 Capability Days session will run from 4-6 June 2024.
Resolved	Service Alert	2024-05-27 03:09	2024-05-27 13:00	ARCHER2 compute nodes	CPU compute nodes are unavailable, any jobs running at the time of the power incident will have failed. GPU nodes are available	Power incident on UK national grid in the Edinburgh area resulted in loss of power to ARCHER2 compute nodes
Resolved	Service Alert	2024-05-22 11:30	2024-05-22 13:30	Access to License server	The License server is inaccessible - our team are working to restore
Resolved	At-Risk	2024-05-20 10:00	2024-05-25	ARCHER2 Nodes	A rolling-reboot to update the compute nodes on ARCHER2 which includes the newer CPE (Cray Programming Environment) 23.09. This will not impact running work but once jobs finish, compute nodes will be rebooted and then be returned to service with the new updated software. Serial work is unaffected.
Resolved	Service Alert	2024-05-09 10:00	2024-05-09 12:00	ARCHER2 rundeck ticketing server	May be a delay in processing new user requests via SAFE	Physical moving of the server hosting the rundeck ticketing system
Resolved	Service Alert	2024-05-08 14:00	2024-05-08 14:08	Connectivity to ARCHER2 may have a short outage but no impact is expected	We do not expect any user impact but if there is an issue it will be a short connectivity outage	Changing power supply for the JANET CIENA unit
Resolved	Issue	2024-04-26 08:25	2024-04-26 10:00	Serial nodes	Serial node dvn01 is currently unavailable. Serial jobs are queued and running but performance may be slower than usual until the issue is resolved.
Resolved	Service Alert	2024-04-25 09:30	2024-04-25 10:40	Serial Nodes, DVN01 and DVN02	Users will not be able to use the serial nodes. This means members of n02 will not be able to run jobs as their workflow depends on the serial nodes. We appreciate this is both critical and urgent for this project and HPE are investigating.	The heavy load on the metadata server may have impacted the slurm controller and caused the slurm deamon to fail on these nodes. Investigation is ongoing.
Resolved	Service Alert	2024-04-25 14:00	2024-04-25 16:00	Connectivity to ARCHER2 may have a short outage but no impact is expected	We do not expect any user impact but if there is an issue it will be a short connectivity outage	Changing power supply for the JANET CIENA unit
Resolved	Service Alert	2024-04-24 08:00	2024-04-24 13:30	Emails from Service Desk	We believe that emails being sent from the ARCHER2 Service Desk are being delayed downstream, causing them not to be received promptly. We are working to resolve.
Resolved	Service Alert	2024-04-22 08:00	2024-04-22 12:00	Connectivity to ARCHER2 may have a short outage but no impact is expected	We do not expect any user impact but if there is an issue it will be a short connectivity outage	Changing power supply for the JANET CIENA unit
Resolved	Service Alert	2024-04-15 14:00	2024-04-15 16:00	ARCHER2 rundeck ticketing server	May be a delay in processing new user requests via SAFE	Physical moving of the server hosting the rundeck ticketing system
Resolved	Service Alert	2024-04-15 15:30	2024-04-15 16:40	ARCHER2 login node	Users cannot currently connect to ARCHER2	Physical moving of the server hosting the ARCHER2 ldap server
Resolved	Service Alert	2024-04-15 10:00	2024-04-15 10:30	Outage to DNS server which will impact ARCHER2 and ARCHER2 SAFE	Users can still connect to service but may be unable to access external websites (eg GitLab)	Migration of server in preparation of the wider power work affecting site the following week
Resolved	Service Alert	2024-04-11 10:00	2024-04-11 10:40	ARCHER2 rundeck ticketing server	May be a delay in processing new user requests via SAFE	Migration of the rundeck ticketing system
Resolved	Service Alert	2024-04-09 10:00	2024-04-09 11:00	ARCHER2 slurm scheduler	ARCHER2 Slurm Controller will be restarted this morning. Running jobs will continue to run, but Slurm commands will be unavailable for a few minutes.	Adjustment of a scheduling parameter
Resolved	Service Alert	2024-04-06 21:37	2024-04-06 23:45	ARCHER2 work4 (fs4) file system	Partial loss of access to work4 (fs4) for a short while	HPE Support are investigating root cause.
Resolved	Service Alert	2024-04-23 12:00	2024-04-24 15:30	ARCHER2 work (fs3) file system	Slow response when accessing data on the file system. Update 24th April: We are continuing to investigate and our on-site HPE support team have escalated the issue. Darshan IO monitoring has been enabled for all jobs to help identify the issue.	Extreme load on metadata server.
Resolved	Service Alert	2024-03-27 11:45 GMT	2024-03-28 11:45 GMT	All parallel jobs launched using srun	All parallel jobs launched using `srun` will have their IO profile captured by the Darshan IO profiling tool. In rare cases this may cause jobs to fail or impact performance. Users can disable Darshan by adding the line `module remove darshan` before they use `srun` in their job submission scripts.	Capturing data on the IO use on ARCHER2 to improve the service.
Resolved	Service Alert	2024-03-24 23:00	2024-03-25 11:00	Most ARCHER2 compute nodes	Users will not be able to run jobs on most of the ARCHER2 compute nodes. Jobs running on compute nodes at the time of the incident will also have failed.	Some compute nodes temporarily lost power and are in the process of being brough back into service.
Resolved	Service Alert	2024-03-21 11:15	2024-03-21 18:30	RDFaaS filesystem	Users may experience issues accessing files on the RDFaaS which includes /epsrc and /general file systems.	The e1000 which hosts RDFaaS is experiencing issues.
Resolved	Service Alert	2024-03-12 09:30	2024-03-13 15:30	ARCHER2 GPU nodes	The ARCHER2 GPU nodes are reserved Tuesday 12-03-2024 from 09:30 to 17:00 Wednesday 13-03-2024 from 09:30 to 15:30	The GPU nodes are being used for a training course. Normal access will be restored at 15:30 on Wednesday when the course ends.
Resolved	Service Alert	2024-03-08 09:00	2024-03-12 18:30	Compute nodes	We are currently using a rolling-reboot to update the compute nodes on ARCHER2. This will not impact running work but once jobs finish, compute nodes will be rebooted and then be returned to service with the new updated software. Serial work is unaffected.	Updates to ARCHER2 compute nodes
Resolved	Service Alert	2024-02-23 12:30	2024-02-23 16:40	Some ARCHER2 compute nodes after a power outage	New jobs had been stopped and some nodes down. New jobs now running and almost all nodes back in service.	Some compute nodes temporarily lost power
Resolved	Service Alert	2024-02-15 14:50 GMT	2024-02-15 15:50	ARCHER2 work3 (fs3) file system	Very slow response when accessing data on the file system.	Extreme load on metadata server.
Resolved	Service Alert	2024-01-30 09:50 GMT	2024-01-31 09:50 GMT	All parallel jobs launched using srun	All parallel jobs launched using `srun` will have their IO profile captured by the Darshan IO profiling tool. In rare cases this may cause jobs to fail or impact performance. Users can disable Darshan by adding the line `module remove darshan` before they use `srun` in their job submission scripts.	Capturing data on the IO use on ARCHER2 to improve the service.
Resolved	Service Alert	2024-01-29 08:00 GMT	2024-01-29 14:30	ARCHER2 work3 (fs3) file system	Very slow response or timeout errors when accessing data on the file system. Running jobs using fs3 will have been killed.	Our monitoriing has detected issues accessing data on the file system and we are investigating
Resolved	Service Alert	2024-01-16 09:00 GMT	2024-01-19	Compute nodes	We are currently using a rolling-reboot to update the compute nodes on ARCHER2. This will not impact running work but once jobs finish, compute nodes will be rebooted and then be returned to service with the new updated software. Serial work is unaffected.	Updates to ARCHER2 compute nodes
Resolved	Service Alert	2024-01-11 10:00	2024-01-11 11:00	Slurm	Users may see short interruptions to Slurm functionality (e.g. `sbatch`, `squeue` commands). If you experience issues please wait a couple of minutes and try again.	Slurm software is being updated on the system
Resolved	Service Alert	2024-01-15 11:00	2024-01-15 15:00	Outage to slurm scheduler	Users will be able to connect to ARCHER2, access their data but slurm will be unavailable during this work. Running jobs will continue but users will not be able to submit new jobs. Users will be notified when slurm is available from the login nodes.	Slurm software is being updated to integrate the GPU nodes
Resolved	Service Alert	2024-01-07 11:30 GMT	2024-01-08 09:30 GMT	ARCHER2 work3 (fs3) file system	Some users may see slower than normal file system response	HPE engineers have detected slow response time from the work3 file system and are investigating

Status	Type	Start	End	Scope	User Impact	Reason
Resolved	Service Alert	2023-10-19 07:45	2023-10-19 10:30	3 Cabinets of Compute Nodes	There was a power incident which caused a power outage to three ARCHER2 cabinets. Power has been restored to the cabinets but some user jobs will have failed. Jobs should not have been charged.	Severe weather in the local area
Resolved	Service Alert	2023-12-12 12:00	2023-12-21 12:00	ARCHER2 file system	Some users report experiencing slower than normal file system response	We are investigating and monitoring the issue.
Resolved	Service Alert	2023-12-06 09:00	2023-12-06 10:00	ARCHER2 login nodes	Users will now connect using a ssh key and passphrase and a time-based one time password	Enhance security
Resolved	Service Alert	2023-11-22 10:00	2023-11-22 10:10	/work lustre file systems	The change should take minutes and should not impact users	Change in lustre configuration
Resolved	Service Alert	2023-11-06 09:20	2023-11-06 11:40	Compute nodes	All running jobs have failed, no new jobs can start on the compute nodes.	An extenal power event has caused all ARCHER2 compute nodes to be unavailable
Resolved	Service Alert	2023-10-30 10:00	2023-11-01 10:00	Slurm	Users may see short interruptions to Slurm functionality (e.g. `sbatch`, `squeue` commands)	Slurm software is being updated on the system
Resolved	Service Alert	2023-09-28 17:00	2023-09-29 16:00	ARCHER2 rolling reboot	We are currently using a rolling-reboot to update some of the nodes on ARCHER2. Whist this is ongoing, existing running work will continue but some new work will not be started. Serial work is unaffected.	Updates to some ARCHER2 nodes
Resolved	Service Alert	2023-09-01 09:00	2023-10-05	ARCHER2 work (Lustre) file systems	Users may see errors such as "input/outut" errors when accessing data on Lustre file systems	A patch was installed to address the known Lustre bug which may have causes these issues
Resolved	Service Alert	2023-08-10 09:00	2023-08-10 12:35	Work file system 3 (fs3)	Commands on fs3 work file system will hang or become unresponsive	No new jobs will start until HPE have resolved the issue.
Resolved	Service Alert	2023-06-15 15:00	2023-07-15 22:00	ARCHER2 compute nodes	Jobs may fail intermittently with memory issues (e.g. OOM, hanging with no output, segfault)	A kernel memory leak affected compute nodes with reduced memory being available on nodes over time. HPE have issued a patched kernel for the issue which has been tested by the CSE team and has demonstrated clear improvements in the memory usage. The patch has been applied to all compute nodes.
Resolved	Service Alert	2023-06-13 14:45	2023-06-14 14:40	VASP5 module out of memory errors	The CSE team is aware that the VASP5 module is returning "Out of memory" errors with some kinds of simulations.	Module has been rebuilt and issue is resolved.
Resolved	Service Alert	2023-06-12 12:55	2023-06-12 17:30	Work file system 3 (fs3)	Commands on fs3 work file system will hang or become unresponsive	Issue was identified as a user job and also an issue with an OST (Object Storage Target). OST was switched out and user asked to remove jobs for further investigation. An issue with the fs3 Lustre file system. Other file systems are working as normal.
Resolved	Service Alert	2023-04-18 01:30	2023-04-18 12:52	Login nodes, compute nodes, Lustre file systems	No access to ARCHER2, jobs running at the time of the issue will have failed, no new jobs allowed to start	We are investigating an issue with the Slingshot interconnect on ARCHER2 which has caused compute nodes and Lustre file systems to lose connectivity
Resolved	Service Alert	2023-04-06 11:45	2023-04-06 16:45	Slurm scheduler	Users may see issues submitting jobs to Slurm, with the behaviour of jobs and with issuing Slurm commands	We are investigating an issue with the Slurm scheduler
Resolved	Service Alert	2023-03-28 09:45	2023-03-28 19:25	Login nodes. Compute nodes	Update 1925 BST 28 Mar 2023 Compute nodes are now returned to service and reservation has been removed so jobs will now run (30 nodes missing and will hopefully be returned to service tomorrow morning). Update 1845 BST 28 Mar 2023 - login access is available, compute nodes in process of being brought back into service. Jobs can be submitted and will start once the compute nodes are available. Original impact - new login sessions have been blocked. Existing login sessions may become unresponsive. All new jobs on the compute nodes have been prevented from starting. Current running work may fail or run slow.	We are investigating issues with instability on the ARCHER2 backend cluster
Resolved	Service alert	2023-02-14 10:15 GMT	2023-02-14 13:30 GMT	/work file system (fs2), possible intermittent issues on login/compute nodes	Projects with directories on the fs2 /work file system will not be allowed to run jobs (as the resources may just be wasted) and may see that some data is inaccessible. Users may see occassional issues on login/compute nodes as they try to access the filesystem. You can check which file system your work directory is on by navigating to the location and using the command `readlink -f .`	Heavy i/o caused by a user's jobs. CSE team will work with the user.
Resolved	Partial	2023-02-07 18:40	2023-02-08	Four ARCHER2 cabinets currently unavailable. The remainder of ARCHER2 is continuing to run as normal.	All jobs running on the affected cabinets will have failed. These should not have been charged.	Four ARCHER2 cabinets experienced a power interruption. Cabinets returned to service.
Resolved	Service alert	2023-01-23 11:30 GMT	2023-01-23 15:15 GMT	/work file system (fs3), possible intermittent issues on login/compute nodes	Projects with directories on the fs3 /work file system will not be allowed to run jobs (as the resources may just be wasted) and may see that some data is inaccessible. Users may seem occaissional issues on login/compute nodes as they try to access the failed OSS. You can check which file system your work directory is on by navigating to the location and using the command `readlink -f .`	Issue is now resolved following a Lustre OSS failure in the fs3 /work file system

Status	Type	Start	End	Scope	User Impact	Reason
Resolved	Partial	2022-12-14 09:40	2022-12-14 15:30	Around 2000 nodes are unavailable.	User jobs may take longer to run. Users can still connect to ARCHER2 and submit jobs.	Full service is now available following a power and possible network issue.

Status	Type	Start	End	Scope	User Impact	Reason
Resolved	Service change	2022-12-12 09:45 GMT	2023-12-12 09:45 GMT	Compute nodes	Users may see changes in application performance	The default CPU frequency for parallel jobs started using `srun` has been changed to 2.0 GHz to improve the energy efficiency of ARCHER2. We recommend that users test the energy efficiency of their applications and set the CPU frequency appropriately.

Status	Type	Start	End	Scope	User Impact	Reason
Resolved	Service alert	2022-12-06 09:00 GMT	2022-12-06 12:00 GMT	/work file system (fs3)	Work to integrate the additional ARCHER2 Lustre file system may result in disruption to one of the existing work file systems: fs3. This will impact any projects who have work storage on fs3. If the work caused any impact to running jobs please contact the service desk with a note of the affected job IDs.
Resolved	Service alert	2022-11-14 09:00 GMT	2022-12-02 09:00 GMT	ARCHER2 Compute Nodes	Scottish Power Energy Networks (SPEN) have announced a three week at-risk period for the power input to the ACF building. We are hopeful that there will be no user impact but want to share the alert with users. If there is a power blip, we anticipate there will not be any impact to the login nodes but expect there may be an interruption to the compute nodes. Further details will be provided if there are any issues.
Resolved	Service alert	2022-11-13 10:00 GMT	2022-11-14 09:00 GMT	SAFE website	Users will get a security warning when trying to access SAFE website; some web browsers (e.g. Chrome) will not connect to SAFE website; ARCHER2 load plot on status page will not work	The website certificate has expired
Resolved	Service alert	2022-10-24 12:00	2022-10-24 18:00	Administration nodes	Users may experience brief interruptions to node response and delays in Slurm commands	We are updating the system to resolve a security vulnerability
Resolved	Issue	2022-10-05 09:00	2022-10-17 13:15	Slingshot interconnect	Some jobs may fail with MPI errors (e.g. "No route to host", "pmi_init failure"), the larger the number of nodes used in the job, the more likely users are to hit the issue. A reboot of Slingshot is planned for Mon 17 Oct 2022 - see maintenance table below for details.	A number of active optical cable links between groups in the interconnect topology are not working
Resolved	Unplanned	2022-10-11 12:30	2022-10-11 13:00	DNS issues across the UoE	Users were be able to connect to ARCHER2 or SAFE.	UoE DNS and network issue.
Resolved	Partial	2022-09-21 09:40	2022-09-21 11:15	An issue affected the high-speed network for some nodes.	Some node comms may be affected.	HPE have reset the affected switches and the issue is now fixed.
Resolved	At-Risk	2022-09-12 09:00	2022-10-07 17:00	Groups of four cabinets will be removed from service.	Reduced number of compute nodes available to users	HPE are carrying out essential work on the ARCHER2 system
Resolved	Partial	2022-08-31 09:00	2022-08-31 11:00	No access to login nodes, data and SAFE. Running jobs will not be affected and new jobs will start.	Up to 2 hours loss of connection to ARCHER2 login nodes, data and SAFE access	Essential updates to the network configuration. Users will be notified when full service is resumed.
Resolved	At-risk	2022-08-24 09:00	2022-08-24 13:00	This maintenance session has now been downgraded to an 'at-risk' session. We do not expect this work to have any impact on user service.	Users will be able to connect to ARCHER2, access data, submit and run jobs.	Essential updates to the network configuration.
Resolved	Service Alert	2022-08-11 12:30	2022-08-12 11:30	Compute nodes	Increased queue times and reduced node availability	All nodes now returned to service. Due to high temperatures in the Edinburgh area and to ease the load on the cooling system some nodes were removed from service. Users could connect to ARCHER2, access data and submit jobs to the batch system.
Resolved	Service Alert	2022-07-19 14:00	2022-07-19 18:30	Full service	Login access not available, no jobs allowed to start, running jobs will have failed	An internal DNS failure on the system has stopped components communicating
Resolved	Service Alert	2022-07-18 14:00	2022-07-20 10:00	Compute nodes	Increased queue times and reduced node availability	Reduced number of compute nodes available as high temperatures in the Edinburgh area are creating cooling issues. Around 4700 compute nodes are currently available and we are continuing to monitor. Full service restored at 10:00 20th.
Resolved	Issue	2022-07-14 22:00	2022-07-15 13:50	Full ARCHER2 system	No access to system	A significant power outage affecting large areas of the Lothians caused ARCHER2 to shut down around 10pm on 14th July. Our team are on site and working to restore service as soon as possible.
Resolved	At-Risk	2022-07-06 09:00	2022-07-06 16:00	/home filesystem and ARCHER2 nodes	No user impact expected	Test new configuration for /home filesystem for connection to PUMA
Resolved	At-Risk	2022-06-29 10:00	2022-06-29 12:00	Four cabinets will be removed from service and then returned to service	Reduced number of compute nodes available to users	Successfully tested the phased approach for future planned work
Resolved	Service Alert	2022-06-15 20:30	2022-06-16 16:30	Compute nodes in cabinets 21-23	734 compute nodes were unavailable to users	Power issue with cabinets 21-23
Resolved	At-Risk	2022-06-09 09:00	2022-09-22 16:00	Compute nodes	No user impact is expected as there is redundancy built into the system	Essential electrical work which will include the cables which feed the Power Distribution Units (PDUs) on ARCHER2
Resolved	At-Risk	2022-06-08 09:00	2022-06-13 16:00	Installation of new Programming Environment (22.04)	No user impact is expected. The new PE will not be the default module so users will need to load it if they wish to use the latest version.	New updated PE is available with several improvements and bug fixes.
Resolved	Issue	2022-05-25 11:30	2022-05-25 12:15	Issue with Slurm accounting DB.	Some users are seeing issues with submitting jobs or using sacct commands to query accounting records.	HPE systems have resolved this issue.
Resolved	Issue	2022-05-24 14:40	2022-05-24 20:40	Login nodes, compute nodes, running jobs	Users may experience issues with interactive access on login nodes. Running jobs may have failed. No new jobs will be allowed to start.	Investigations on the underlying issue are ongoing.
Resolved	Issue	2022-05-19 09:00	2022-05-20 16:00	A hardware issue with one of the ARCHER2 worker nodes has caused a number of the compute nodes to go into 'completing' mode. These compute nodes require a reboot.	Some compute nodes are unavailable as reboots are performed. Some user jobs will not complete fully and other may fail. These jobs should not be charged but please contact support@archer2.ac.uk if you want to confirm whether a refund is required.	Hardware issue with a worker node.
Resolved	Issue	2022-05-09 09:00	2022-05-23 12:00	The Slurm batch scheduler will be upgraded. In total the work will take around a week but user impact is expected to be limited to 6 hours when users will not be able to submit new jobs and new jobs will not start	Wednesday 11th May 1000 – 1715 (users notified when work was completed) and Monday 23rd May 1000 – 1200 Running jobs will not be impacted but users will not be able to submit new jobs and new jobs will not start.	Updating the Slurm software.
Resolved	Issue	2022-04-19 10:00	2022-07-19 10:00	Work file systems	Users may see issues with slow disk I/O and slow response on login nodes.	There is a heavy load on the /work filesystems. HPE are investigating the cause and we will update as we get more information.
Resolved	Issue	2022-03-30 10:00	2022-03-30 11:00	The Slurm batch scheduler will be updated with a new fair share policy.	Running jobs will not be impacted but users will not be able to submit new jobs. If users are impacted, they should wait and then resubmit the job once the work has completed.	Updating the Slurm configuration.
Resolved	Issue	2022-03-30 11:00	2022-03-30 14:20	Slurm scheduler	Users may see issues with submitting jobs: `Batch job submission failed: Requested node configuration is not available` and `srun: error: Unable to create step for job 1351277: More processors requested than permitted` which will see jobs not submit or fail at runtime.	Part of this morning's update has been rolled back and we believe this issue is thus now resolved.
Resolved	Issue	2022-03-24 15:00:00 +0000	2022-03-24 15:45:00 +0000	67 nodes within cabinets 20-23	Reduced number of compute nodes being available which may cause longer queue times	An issue has occurred while maintenance work was taking place on cabinets 20-23
Resolved	Issue	2022-03-02 10:00:00 +0000	2022-03-02 10:15:00 +0000	The DNS Server will be moved to a new server.	The outgoing connections from ARCHER2 may be affected for up to five minutes.	Updating of the DNS Server network.
Resolved	Issue	2022-02-22 09:30:00 +0000	2022-02-23 10:00:00 +0000	One ARCHER2 cabinet removed from service.	A replacement PDU is required for an ARCHER2 cabinet. Once replaced, the cabinet will be returned to service.	A cabinet of compute nodes are unavailable.
Resolved	Downtime	2022-02-21 15:00:00 +0000	2022-02-22 09:30:00 +0000	Work (/work) Lustre file systems. Compute nodes.	Access to ARCHER2 has been disabled so users will not be able to log on. Running jobs may have failed. No new user jobs will be allowed to start.	Work file systems performance issues. A large number of compute nodes are unavailable.
Resolved	Issue	2022-02-04 12:05:00 +0000	2022-02-03 12:34:00 +0000	Login nodes	home file system unavailable and consequentially prevented or significantly slowed logins	network connectivity between the frontend and HomeFS was lost
Resolved	Issue	2022-02-03 14:30:00 +0000	2022-02-03 15:00:00 +0000	Login nodes	Users may be unable to connect to the login nodes	Planned reboot of switch caused loss of connection to login nodes
Resolved	Downtime	2022-02-02 11:30:00 +0000	2022-02-02 12:05:00 +0000	Login issue	Temporary issue with users unable to connect to login nodes	short outage on the ldap authentication server while maintenance work took place
Resolved	Issue	2022-01-27 14:30:00 +0000	2022-01-27 15:50:00 +0000	Login and data analysis nodes	Loss of connection to ARCHER2 for any logged in users, an inability to login for users that were not connected, and jobs running on the data analysis nodes ("serial") partition failed.	Login and data analysis nodes needed to be reconfigured
Resolved	Issue	2022-01-24 17:30:00 +0000	2022-01-25 14:30:00 +0000	RDFaaaS	Users cannot access their data in /epsrc or /general on the RDFaaS from ARCHER2	Systems team are investigating intermittent loss of connection between ARCHER2 and RDFaaS
Resolved	Issue	2022-01-20 11:40:00 +0000	2022-01-20 15:20:00 +0000	login nodes	Login node 4 was currently the only login node available - users could use login4.archer2.ac.uk directly to access the system	Systems team are investigating
Resolved	Notice	2022-01-10 12:00:00 +0000	2022-01-10 12:00:00 +0000	4-cabinet system	Users can no longer use the ARCHER2 4-cabinet system or access /work data on the 4-cabinet system	The ARCHER2 4-cabinet system was removed from service as planned
Resolved	Issue	2022-01-09 10:00:00 +0000	2022-01-10 10:30:00 +0000	Login and Data Analysis nodes, SAFE	Outgoing network access from ARCHER2 systems to external sites was not working. SAFE response was slow or degraded.	DNS issue at datacentre
Resolved	Issue	2021-12-31 10:00:00 +0000	2022-01-10 10:00:00 +0000	Compute nodes	Running jobs on down nodes failed, reduced number of compute nodes available	A number of compute nodes are unavailable due to hardware issue

Status	Type	Start	End	Scope	User Impact	Reason
Resolved	Issue	2021-12-20 13:00:00 +0000	2021-12-20 14:12:00 +0000	4-cabinet service Slurm scheduler	Response from Slurm scheduler was degraded, running jobs unaffected	Slurm scheduler was experiencing issues
Resolved	Issue	2021-12-16 04:05:00 +0000	2021-12-16 09:41:00 +0000	Slurm scheduler unavailable, outgoing connections failed	Slurm commands did not work, outgoing connections did not work, running jobs continued without issue	Spine switch became unresponsive
Resolved	Issue	2021-11-14 08:00:00 +0000	2021-11-15 09:40:00 +0000	4-cabinet system monitoring	Load chart on website missing some historical data	A login node issue caused data to not be collected
Resolved	Issue	2021-11-05 08:00:00 +0000	2021-11-05 11:30:00 +0000	4-cabinet system compute nodes	A large number of compute nodes were unavailable for jobs	A power incident in the Edinburgh area caused a number of cabinets to lose power
Resolved	Issue	2021-11-03 14:45:00 +0000	2021-11-03 14:45:00 +0000	4-cabinet system login access	The login-4c.archer2.ac.uk address is unreachable so logins via this address will fail, users can use the address 193.62.216.1 instead	A short network outage at the University of Edinburgh caused issues with resolving the ARCHER2 login host names
Resolved	Issue	2021-11-01 09:30:00 +0000	2021-11-01 11:30:00 +0000	4-cabinet system compute nodes	There are 285 nodes down so queue times may be longer	A user job hit a known bug and brought down 256 compute nodes
Resolved	Issue	2021-11-01 09:30:00 +0000	2021-11-01 11:30:00 +0000	4-cabinet system compute nodes	There were 285 nodes down so queue times may be longer	A user job hit a known bug and brought down 256 nodes

Resolved Service Alerts

2025

2025

2024

2024

2023

2022

2023

2022

2021