2024

Status Type Start End Scope User Impact Reason
Resolved Service Alert 2024-10-01 09:45 2024-10-01 12:15 ARCHER2 work (fs1) file system Slow response to access data on fs1 work file system. `module` commands show slow response.
New work was stopped, but is now being started once more (11.45).
fs1 issues now resolved - please contact the service desk if you see any further problems.
Resolved Service Alert 2024-09-30 0630 2024-09-30 1400 ARCHER2 work (fs4) file system Some data or directories on the file system may be inaccessible. Trying to access inaccessible data may cause the terminal to hang. An OSS has failed and failover did not happen successfully
Resolved Service Alert 2024-09-25 16:00 2024-09-26 15:00 Slurm scheduler Intermittent issues running Slurm commands
Resolved Service Alert 2024-09-24 15:30 2024-09-24 17:00 ARCHER2 work (fs1) file system Slow response to access data on fs1 work file system. `module` commands show slow response. Contention for file system resources
Resolved Service Alert 2024-09-24 08:00 2024-09-26 14:00 ARCHER2 queues Users may observe slightly longer queue times for other work while some nodes are reserved for the Capability QoS. ARCHER2 Capability Days
The third ARCHER2 Capability Days session will run from 24-26 September 2024.
Resolved Issue 2024-08-02 09:00 2024-08-09 17:00 RDFaaS (file systems /epsrc and /general) DMF tape backup system The tape backup service will be unavailble for the week. This means that no new data will be backed up during the week so there is a small risk that new data could be lost during this week. Once the service is resumed a catch up back up will take place which means all data will then be backed up. Physical moving of DMF tape drive at the Advanced Computing Facility (ACF)
Resolved Service Alert 2024-07-18 20:30 2024-07-19 11:54 ARCHER2 work (fs3) file system Issues accessing data on work (fs3) file system Normal service has been restored
Resolved Issue 2024-06-20 12:30 2024-06-20 12:40 The Slurm batch scheduler will be updated with a new certificate Running jobs will not be impacted but users will not be able to submit new jobs for a brief period (around 5-10 minutes). If users are impacted, they should wait and then resubmit the job once the work has completed. Updating the Slurm certificate.
Resolved Service Alert 2024-06-05 14:15 2024-06-05 20:40 ARCHER2 Capability Days compute nodes Capability Days jobs not starting Previous Capability Days job I/O caused nodes to fail
Resolved Service Alert 2024-06-05 14:15 2024-06-05 20:40 ARCHER2 work (fs3) file system Issues accessing data on work (fs3) file system Failure of Lustre server node and automatic failover of Lustre server node did not succeed
Resolved Service Alert 2024-06-04 08:00 2024-06-06 14:00 ARCHER2 queues Users may observe slightly longer queue times for other work while some nodes are reserved for the Capability QoS. ARCHER2 Capability Days
The second ARCHER2 Capability Days session will run from 4-6 June 2024.
Resolved Service Alert 2024-05-27 03:09 2024-05-27 13:00 ARCHER2 compute nodes CPU compute nodes are unavailable, any jobs running at the time of the power incident will have failed. GPU nodes are available Power incident on UK national grid in the Edinburgh area resulted in loss of power to ARCHER2 compute nodes
Resolved Service Alert 2024-05-22 11:30 2024-05-22 13:30 Access to License server The License server is inaccessible - our team are working to restore
Resolved At-Risk 2024-05-20 10:00 2024-05-25 ARCHER2 Nodes A rolling-reboot to update the compute nodes on ARCHER2 which includes the newer CPE (Cray Programming Environment) 23.09. This will not impact running work but once jobs finish, compute nodes will be rebooted and then be returned to service with the new updated software. Serial work is unaffected.
Resolved Service Alert 2024-05-09 10:00 2024-05-09 12:00 ARCHER2 rundeck ticketing server May be a delay in processing new user requests via SAFE Physical moving of the server hosting the rundeck ticketing system
Resolved Service Alert 2024-05-08 14:00 2024-05-08 14:08 Connectivity to ARCHER2 may have a short outage but no impact is expected We do not expect any user impact but if there is an issue it will be a short connectivity outage Changing power supply for the JANET CIENA unit
Resolved Issue 2024-04-26 08:25 2024-04-26 10:00 Serial nodes Serial node dvn01 is currently unavailable. Serial jobs are queued and running but performance may be slower than usual until the issue is resolved.
Resolved Service Alert 2024-04-25 09:30 2024-04-25 10:40 Serial Nodes, DVN01 and DVN02 Users will not be able to use the serial nodes. This means members of n02 will not be able to run jobs as their workflow depends on the serial nodes. We appreciate this is both critical and urgent for this project and HPE are investigating. The heavy load on the metadata server may have impacted the slurm controller and caused the slurm deamon to fail on these nodes. Investigation is ongoing.
Resolved Service Alert 2024-04-25 14:00 2024-04-25 16:00 Connectivity to ARCHER2 may have a short outage but no impact is expected We do not expect any user impact but if there is an issue it will be a short connectivity outage Changing power supply for the JANET CIENA unit
Resolved Service Alert 2024-04-24 08:00 2024-04-24 13:30 Emails from Service Desk We believe that emails being sent from the ARCHER2 Service Desk are being delayed downstream, causing them not to be received promptly. We are working to resolve.
Resolved Service Alert 2024-04-22 08:00 2024-04-22 12:00 Connectivity to ARCHER2 may have a short outage but no impact is expected We do not expect any user impact but if there is an issue it will be a short connectivity outage Changing power supply for the JANET CIENA unit
Resolved Service Alert 2024-04-15 14:00 2024-04-15 16:00 ARCHER2 rundeck ticketing server May be a delay in processing new user requests via SAFE Physical moving of the server hosting the rundeck ticketing system
Resolved Service Alert 2024-04-15 15:30 2024-04-15 16:40 ARCHER2 login node Users cannot currently connect to ARCHER2 Physical moving of the server hosting the ARCHER2 ldap server
Resolved Service Alert 2024-04-15 10:00 2024-04-15 10:30 Outage to DNS server which will impact ARCHER2 and ARCHER2 SAFE Users can still connect to service but may be unable to access external websites (eg GitLab) Migration of server in preparation of the wider power work affecting site the following week
Resolved Service Alert 2024-04-11 10:00 2024-04-11 10:40 ARCHER2 rundeck ticketing server May be a delay in processing new user requests via SAFE Migration of the rundeck ticketing system
Resolved Service Alert 2024-04-09 10:00 2024-04-09 11:00 ARCHER2 slurm scheduler ARCHER2 Slurm Controller will be restarted this morning.
Running jobs will continue to run, but Slurm commands will be unavailable for a few minutes.
Adjustment of a scheduling parameter
Resolved Service Alert 2024-04-06 21:37 2024-04-06 23:45 ARCHER2 work4 (fs4) file system Partial loss of access to work4 (fs4) for a short while HPE Support are investigating root cause.
Resolved Service Alert 2024-04-23 12:00 2024-04-24 15:30 ARCHER2 work (fs3) file system Slow response when accessing data on the file system.
Update 24th April: We are continuing to investigate and our on-site HPE support team have escalated the issue. Darshan IO monitoring has been enabled for all jobs to help identify the issue.
Extreme load on metadata server.
Resolved Service Alert 2024-03-27 11:45 GMT 2024-03-28 11:45 GMT All parallel jobs launched using srun All parallel jobs launched using `srun` will have their IO profile captured by the Darshan IO profiling tool. In rare cases this may cause jobs to fail or impact performance. Users can disable Darshan by adding the line `module remove darshan` before they use `srun` in their job submission scripts. Capturing data on the IO use on ARCHER2 to improve the service.
Resolved Service Alert 2024-03-24 23:00 2024-03-25 11:00 Most ARCHER2 compute nodes Users will not be able to run jobs on most of the ARCHER2 compute nodes. Jobs running on compute nodes at the time of the incident will also have failed. Some compute nodes temporarily lost power and are in the process of being brough back into service.
Resolved Service Alert 2024-03-21 11:15 2024-03-21 18:30 RDFaaS filesystem Users may experience issues accessing files on the RDFaaS which includes /epsrc and /general file systems. The e1000 which hosts RDFaaS is experiencing issues.
Resolved Service Alert 2024-03-12 09:30 2024-03-13 15:30 ARCHER2 GPU nodes The ARCHER2 GPU nodes are reserved
Tuesday 12-03-2024 from 09:30 to 17:00
Wednesday 13-03-2024 from 09:30 to 15:30
The GPU nodes are being used for a training course. Normal access will be restored at 15:30 on Wednesday when the course ends.
Resolved Service Alert 2024-03-08 09:00 2024-03-12 18:30 Compute nodes We are currently using a rolling-reboot to update the compute nodes on ARCHER2. This will not impact running work but once jobs finish, compute nodes will be rebooted and then be returned to service with the new updated software. Serial work is unaffected. Updates to ARCHER2 compute nodes
Resolved Service Alert 2024-02-23 12:30 2024-02-23 16:40 Some ARCHER2 compute nodes after a power outage New jobs had been stopped and some nodes down. New jobs now running and almost all nodes back in service. Some compute nodes temporarily lost power
Resolved Service Alert 2024-02-15 14:50 GMT 2024-02-15 15:50 ARCHER2 work3 (fs3) file system Very slow response when accessing data on the file system. Extreme load on metadata server.
Resolved Service Alert 2024-01-30 09:50 GMT 2024-01-31 09:50 GMT All parallel jobs launched using srun All parallel jobs launched using `srun` will have their IO profile captured by the Darshan IO profiling tool. In rare cases this may cause jobs to fail or impact performance. Users can disable Darshan by adding the line `module remove darshan` before they use `srun` in their job submission scripts. Capturing data on the IO use on ARCHER2 to improve the service.
Resolved Service Alert 2024-01-29 08:00 GMT 2024-01-29 14:30 ARCHER2 work3 (fs3) file system Very slow response or timeout errors when accessing data on the file system. Running jobs using fs3 will have been killed. Our monitoriing has detected issues accessing data on the file system and we are investigating
Resolved Service Alert 2024-01-16 09:00 GMT 2024-01-19 Compute nodes We are currently using a rolling-reboot to update the compute nodes on ARCHER2. This will not impact running work but once jobs finish, compute nodes will be rebooted and then be returned to service with the new updated software. Serial work is unaffected. Updates to ARCHER2 compute nodes
Resolved Service Alert 2024-01-11 10:00 2024-01-11 11:00 Slurm Users may see short interruptions to Slurm functionality (e.g. `sbatch`, `squeue` commands). If you experience issues please wait a couple of minutes and try again. Slurm software is being updated on the system
Resolved Service Alert 2024-01-15 11:00 2024-01-15 15:00 Outage to slurm scheduler Users will be able to connect to ARCHER2, access their data but slurm will be unavailable during this work. Running jobs will continue but users will not be able to submit new jobs. Users will be notified when slurm is available from the login nodes. Slurm software is being updated to integrate the GPU nodes
Resolved Service Alert 2024-01-07 11:30 GMT 2024-01-08 09:30 GMT ARCHER2 work3 (fs3) file system Some users may see slower than normal file system response HPE engineers have detected slow response time from the work3 file system and are investigating

2023

Status Type Start End Scope User Impact Reason
Resolved Service Alert 2023-10-19 07:45 2023-10-19 10:30 3 Cabinets of Compute Nodes There was a power incident which caused a power outage to three ARCHER2 cabinets. Power has been restored to the cabinets but some user jobs will have failed. Jobs should not have been charged. Severe weather in the local area
Resolved Service Alert 2023-12-12 12:00 2023-12-21 12:00 ARCHER2 file system Some users report experiencing slower than normal file system response We are investigating and monitoring the issue.
Resolved Service Alert 2023-12-06 09:00 2023-12-06 10:00 ARCHER2 login nodes Users will now connect using a ssh key and passphrase and a time-based one time password Enhance security
Resolved Service Alert 2023-11-22 10:00 2023-11-22 10:10 /work lustre file systems The change should take minutes and should not impact users Change in lustre configuration
Resolved Service Alert 2023-11-06 09:20 2023-11-06 11:40 Compute nodes All running jobs have failed, no new jobs can start on the compute nodes. An extenal power event has caused all ARCHER2 compute nodes to be unavailable
Resolved Service Alert 2023-10-30 10:00 2023-11-01 10:00 Slurm Users may see short interruptions to Slurm functionality (e.g. `sbatch`, `squeue` commands) Slurm software is being updated on the system
Resolved Service Alert 2023-09-28 17:00 2023-09-29 16:00 ARCHER2 rolling reboot We are currently using a rolling-reboot to update some of the nodes on ARCHER2.
Whist this is ongoing, existing running work will continue but some new work will not be started.
Serial work is unaffected.
Updates to some ARCHER2 nodes
Resolved Service Alert 2023-09-01 09:00 2023-10-05 ARCHER2 work (Lustre) file systems Users may see errors such as "input/outut" errors when accessing data on Lustre file systems A patch was installed to address the known Lustre bug which may have causes these issues
Resolved Service Alert 2023-08-10 09:00 2023-08-10 12:35 Work file system 3 (fs3) Commands on fs3 work file system will hang or become unresponsive No new jobs will start until HPE have resolved the issue.
Resolved Service Alert 2023-06-15 15:00 2023-07-15 22:00 ARCHER2 compute nodes Jobs may fail intermittently with memory issues (e.g. OOM, hanging with no output, segfault) A kernel memory leak affected compute nodes with reduced memory being available on nodes over time. HPE have issued a patched kernel for the issue which has been tested by the CSE team and has demonstrated clear improvements in the memory usage. The patch has been applied to all compute nodes.
Resolved Service Alert 2023-06-13 14:45 2023-06-14 14:40 VASP5 module out of memory errors The CSE team is aware that the VASP5 module is returning "Out of memory" errors with some kinds of simulations. Module has been rebuilt and issue is resolved.
Resolved Service Alert 2023-06-12 12:55 2023-06-12 17:30 Work file system 3 (fs3) Commands on fs3 work file system will hang or become unresponsive Issue was identified as a user job and also an issue with an OST (Object Storage Target). OST was switched out and user asked to remove jobs for further investigation. An issue with the fs3 Lustre file system. Other file systems are working as normal.
Resolved Service Alert 2023-04-18 01:30 2023-04-18 12:52 Login nodes, compute nodes, Lustre file systems No access to ARCHER2, jobs running at the time of the issue will have failed, no new jobs allowed to start We are investigating an issue with the Slingshot interconnect on ARCHER2 which has caused compute nodes and Lustre file systems to lose connectivity
Resolved Service Alert 2023-04-06 11:45 2023-04-06 16:45 Slurm scheduler Users may see issues submitting jobs to Slurm, with the behaviour of jobs and with issuing Slurm commands We are investigating an issue with the Slurm scheduler
Resolved Service Alert 2023-03-28 09:45 2023-03-28 19:25 Login nodes. Compute nodes Update 1925 BST 28 Mar 2023 Compute nodes are now returned to service and reservation has been removed so jobs will now run (30 nodes missing and will hopefully be returned to service tomorrow morning).
Update 1845 BST 28 Mar 2023 - login access is available, compute nodes in process of being brought back into service. Jobs can be submitted and will start once the compute nodes are available.
Original impact - new login sessions have been blocked. Existing login sessions may become unresponsive. All new jobs on the compute nodes have been prevented from starting. Current running work may fail or run slow.
We are investigating issues with instability on the ARCHER2 backend cluster
Resolved Service alert 2023-02-14 10:15 GMT 2023-02-14 13:30 GMT /work file system (fs2), possible intermittent issues on login/compute nodes Projects with directories on the fs2 /work file system will not be allowed to run jobs (as the resources may just be wasted) and may see that some data is inaccessible. Users may see occassional issues on login/compute nodes as they try to access the filesystem. You can check which file system your work directory is on by navigating to the location and using the command readlink -f . Heavy i/o caused by a user's jobs. CSE team will work with the user.
Resolved Partial 2023-02-07 18:40 2023-02-08 Four ARCHER2 cabinets currently unavailable. The remainder of ARCHER2 is continuing to run as normal. All jobs running on the affected cabinets will have failed. These should not have been charged. Four ARCHER2 cabinets experienced a power interruption. Cabinets returned to service.
Resolved Service alert 2023-01-23 11:30 GMT 2023-01-23 15:15 GMT /work file system (fs3), possible intermittent issues on login/compute nodes Projects with directories on the fs3 /work file system will not be allowed to run jobs (as the resources may just be wasted) and may see that some data is inaccessible. Users may seem occaissional issues on login/compute nodes as they try to access the failed OSS. You can check which file system your work directory is on by navigating to the location and using the command readlink -f . Issue is now resolved following a Lustre OSS failure in the fs3 /work file system

2022

Status Type Start End Scope User Impact Reason
Resolved Partial 2022-12-14 09:40 2022-12-14 15:30 Around 2000 nodes are unavailable. User jobs may take longer to run. Users can still connect to ARCHER2 and submit jobs. Full service is now available following a power and possible network issue.

2023

Status Type Start End Scope User Impact Reason
Resolved Service change 2022-12-12 09:45 GMT 2023-12-12 09:45 GMT Compute nodes Users may see changes in application performance The default CPU frequency for parallel jobs started using `srun` has been changed to 2.0 GHz to improve the energy efficiency of ARCHER2. We recommend that users test the energy efficiency of their applications and set the CPU frequency appropriately.

2022

Status Type Start End Scope User Impact Reason
Resolved Service alert 2022-12-06 09:00 GMT 2022-12-06 12:00 GMT /work file system (fs3) Work to integrate the additional ARCHER2 Lustre file system may result in disruption to one of the existing work file systems: fs3. This will impact any projects who have work storage on fs3.
If the work caused any impact to running jobs please contact the service desk with a note of the affected job IDs.
Resolved Service alert 2022-11-14 09:00 GMT 2022-12-02 09:00 GMT ARCHER2 Compute Nodes Scottish Power Energy Networks (SPEN) have announced a three week at-risk period for the power input to the ACF building. We are hopeful that there will be no user impact but want to share the alert with users. If there is a power blip, we anticipate there will not be any impact to the login nodes but expect there may be an interruption to the compute nodes. Further details will be provided if there are any issues.
Resolved Service alert 2022-11-13 10:00 GMT 2022-11-14 09:00 GMT SAFE website Users will get a security warning when trying to access SAFE website; some web browsers (e.g. Chrome) will not connect to SAFE website; ARCHER2 load plot on status page will not work The website certificate has expired
Resolved Service alert 2022-10-24 12:00 2022-10-24 18:00 Administration nodes Users may experience brief interruptions to node response and delays in Slurm commands We are updating the system to resolve a security vulnerability
Resolved Issue 2022-10-05 09:00 2022-10-17 13:15 Slingshot interconnect Some jobs may fail with MPI errors (e.g. "No route to host", "pmi_init failure"), the larger the number of nodes used in the job, the more likely users are to hit the issue. A reboot of Slingshot is planned for Mon 17 Oct 2022 - see maintenance table below for details. A number of active optical cable links between groups in the interconnect topology are not working
Resolved Unplanned 2022-10-11 12:30 2022-10-11 13:00 DNS issues across the UoE Users were be able to connect to ARCHER2 or SAFE. UoE DNS and network issue.
Resolved Partial 2022-09-21 09:40 2022-09-21 11:15 An issue affected the high-speed network for some nodes. Some node comms may be affected. HPE have reset the affected switches and the issue is now fixed.
Resolved At-Risk 2022-09-12 09:00 2022-10-07 17:00 Groups of four cabinets will be removed from service. Reduced number of compute nodes available to users HPE are carrying out essential work on the ARCHER2 system
Resolved Partial 2022-08-31 09:00 2022-08-31 11:00 No access to login nodes, data and SAFE. Running jobs will not be affected and new jobs will start. Up to 2 hours loss of connection to ARCHER2 login nodes, data and SAFE access Essential updates to the network configuration. Users will be notified when full service is resumed.
Resolved At-risk 2022-08-24 09:00 2022-08-24 13:00 This maintenance session has now been downgraded to an 'at-risk' session. We do not expect this work to have any impact on user service. Users will be able to connect to ARCHER2, access data, submit and run jobs. Essential updates to the network configuration.
Resolved Service Alert 2022-08-11 12:30 2022-08-12 11:30 Compute nodes Increased queue times and reduced node availability All nodes now returned to service. Due to high temperatures in the Edinburgh area and to ease the load on the cooling system some nodes were removed from service. Users could connect to ARCHER2, access data and submit jobs to the batch system.
Resolved Service Alert 2022-07-19 14:00 2022-07-19 18:30 Full service Login access not available, no jobs allowed to start, running jobs will have failed An internal DNS failure on the system has stopped components communicating
Resolved Service Alert 2022-07-18 14:00 2022-07-20 10:00 Compute nodes Increased queue times and reduced node availability Reduced number of compute nodes available as high temperatures in the Edinburgh area are creating cooling issues. Around 4700 compute nodes are currently available and we are continuing to monitor. Full service restored at 10:00 20th.
Resolved Issue 2022-07-14 22:00 2022-07-15 13:50 Full ARCHER2 system No access to system A significant power outage affecting large areas of the Lothians caused ARCHER2 to shut down around 10pm on 14th July.
Our team are on site and working to restore service as soon as possible.
Resolved At-Risk 2022-07-06 09:00 2022-07-06 16:00 /home filesystem and ARCHER2 nodes No user impact expected Test new configuration for /home filesystem for connection to PUMA
Resolved At-Risk 2022-06-29 10:00 2022-06-29 12:00 Four cabinets will be removed from service and then returned to service Reduced number of compute nodes available to users Successfully tested the phased approach for future planned work
Resolved Service Alert 2022-06-15 20:30 2022-06-16 16:30 Compute nodes in cabinets 21-23 734 compute nodes were unavailable to users Power issue with cabinets 21-23
Resolved At-Risk 2022-06-09 09:00 2022-09-22 16:00 Compute nodes No user impact is expected as there is redundancy built into the system Essential electrical work which will include the cables which feed the Power Distribution Units (PDUs) on ARCHER2
Resolved At-Risk 2022-06-08 09:00 2022-06-13 16:00 Installation of new Programming Environment (22.04) No user impact is expected. The new PE will not be the default module so users will need to load it if they wish to use the latest version. New updated PE is available with several improvements and bug fixes.
Resolved Issue 2022-05-25 11:30 2022-05-25 12:15 Issue with Slurm accounting DB. Some users are seeing issues with submitting jobs or using sacct commands to query accounting records. HPE systems have resolved this issue.
Resolved Issue 2022-05-24 14:40 2022-05-24 20:40 Login nodes, compute nodes, running jobs Users may experience issues with interactive access on login nodes. Running jobs may have failed. No new jobs will be allowed to start. Investigations on the underlying issue are ongoing.
Resolved Issue 2022-05-19 09:00 2022-05-20 16:00 A hardware issue with one of the ARCHER2 worker nodes has caused a number of the compute nodes to go into 'completing' mode. These compute nodes require a reboot. Some compute nodes are unavailable as reboots are performed. Some user jobs will not complete fully and other may fail. These jobs should not be charged but please contact support@archer2.ac.uk if you want to confirm whether a refund is required. Hardware issue with a worker node.
Resolved Issue 2022-05-09 09:00 2022-05-23 12:00 The Slurm batch scheduler will be upgraded. In total the work will take around a week but user impact is expected to be limited to 6 hours when users will not be able to submit new jobs and new jobs will not start Wednesday 11th May 1000 – 1715 (users notified when work was completed) and Monday 23rd May 1000 – 1200
Running jobs will not be impacted but users will not be able to submit new jobs and new jobs will not start.
Updating the Slurm software.
Resolved Issue 2022-04-19 10:00 2022-07-19 10:00 Work file systems Users may see issues with slow disk I/O and slow response on login nodes. There is a heavy load on the /work filesystems. HPE are investigating the cause and we will update as we get more information.
Resolved Issue 2022-03-30 10:00 2022-03-30 11:00 The Slurm batch scheduler will be updated with a new fair share policy. Running jobs will not be impacted but users will not be able to submit new jobs. If users are impacted, they should wait and then resubmit the job once the work has completed. Updating the Slurm configuration.
Resolved Issue 2022-03-30 11:00 2022-03-30 14:20 Slurm scheduler Users may see issues with submitting jobs: `Batch job submission failed: Requested node configuration is not available` and `srun: error: Unable to create step for job 1351277: More processors requested than permitted` which will see jobs not submit or fail at runtime. Part of this morning's update has been rolled back and we believe this issue is thus now resolved.
Resolved Issue 2022-03-24 15:00:00 +0000 2022-03-24 15:45:00 +0000 67 nodes within cabinets 20-23 Reduced number of compute nodes being available which may cause longer queue times An issue has occurred while maintenance work was taking place on cabinets 20-23
Resolved Issue 2022-03-02 10:00:00 +0000 2022-03-02 10:15:00 +0000 The DNS Server will be moved to a new server. The outgoing connections from ARCHER2 may be affected for up to five minutes. Updating of the DNS Server network.
Resolved Issue 2022-02-22 09:30:00 +0000 2022-02-23 10:00:00 +0000 One ARCHER2 cabinet removed from service. A replacement PDU is required for an ARCHER2 cabinet. Once replaced, the cabinet will be returned to service. A cabinet of compute nodes are unavailable.
Resolved Downtime 2022-02-21 15:00:00 +0000 2022-02-22 09:30:00 +0000 Work (/work) Lustre file systems. Compute nodes. Access to ARCHER2 has been disabled so users will not be able to log on. Running jobs may have failed. No new user jobs will be allowed to start. Work file systems performance issues. A large number of compute nodes are unavailable.
Resolved Issue 2022-02-04 12:05:00 +0000 2022-02-03 12:34:00 +0000 Login nodes home file system unavailable and consequentially prevented or significantly slowed logins network connectivity between the frontend and HomeFS was lost
Resolved Issue 2022-02-03 14:30:00 +0000 2022-02-03 15:00:00 +0000 Login nodes Users may be unable to connect to the login nodes Planned reboot of switch caused loss of connection to login nodes
Resolved Downtime 2022-02-02 11:30:00 +0000 2022-02-02 12:05:00 +0000 Login issue Temporary issue with users unable to connect to login nodes short outage on the ldap authentication server while maintenance work took place
Resolved Issue 2022-01-27 14:30:00 +0000 2022-01-27 15:50:00 +0000 Login and data analysis nodes Loss of connection to ARCHER2 for any logged in users, an inability to login for users that were not connected, and jobs running on the data analysis nodes ("serial") partition failed. Login and data analysis nodes needed to be reconfigured
Resolved Issue 2022-01-24 17:30:00 +0000 2022-01-25 14:30:00 +0000 RDFaaaS Users cannot access their data in /epsrc or /general on the RDFaaS from ARCHER2 Systems team are investigating intermittent loss of connection between ARCHER2 and RDFaaS
Resolved Issue 2022-01-20 11:40:00 +0000 2022-01-20 15:20:00 +0000 login nodes Login node 4 was currently the only login node available - users could use login4.archer2.ac.uk directly to access the system Systems team are investigating
Resolved Notice 2022-01-10 12:00:00 +0000 2022-01-10 12:00:00 +0000 4-cabinet system Users can no longer use the ARCHER2 4-cabinet system or access /work data on the 4-cabinet system The ARCHER2 4-cabinet system was removed from service as planned
Resolved Issue 2022-01-09 10:00:00 +0000 2022-01-10 10:30:00 +0000 Login and Data Analysis nodes, SAFE Outgoing network access from ARCHER2 systems to external sites was not working. SAFE response was slow or degraded. DNS issue at datacentre
Resolved Issue 2021-12-31 10:00:00 +0000 2022-01-10 10:00:00 +0000 Compute nodes Running jobs on down nodes failed, reduced number of compute nodes available A number of compute nodes are unavailable due to hardware issue

2021

Status Type Start End Scope User Impact Reason
Resolved Issue 2021-12-20 13:00:00 +0000 2021-12-20 14:12:00 +0000 4-cabinet service Slurm scheduler Response from Slurm scheduler was degraded, running jobs unaffected Slurm scheduler was experiencing issues
Resolved Issue 2021-12-16 04:05:00 +0000 2021-12-16 09:41:00 +0000 Slurm scheduler unavailable, outgoing connections failed Slurm commands did not work, outgoing connections did not work, running jobs continued without issue Spine switch became unresponsive
Resolved Issue 2021-11-14 08:00:00 +0000 2021-11-15 09:40:00 +0000 4-cabinet system monitoring Load chart on website missing some historical data A login node issue caused data to not be collected
Resolved Issue 2021-11-05 08:00:00 +0000 2021-11-05 11:30:00 +0000 4-cabinet system compute nodes A large number of compute nodes were unavailable for jobs A power incident in the Edinburgh area caused a number of cabinets to lose power
Resolved Issue 2021-11-03 14:45:00 +0000 2021-11-03 14:45:00 +0000 4-cabinet system login access The login-4c.archer2.ac.uk address is unreachable so logins via this address will fail, users can use the address 193.62.216.1 instead A short network outage at the University of Edinburgh caused issues with resolving the ARCHER2 login host names
Resolved Issue 2021-11-01 09:30:00 +0000 2021-11-01 11:30:00 +0000 4-cabinet system compute nodes There are 285 nodes down so queue times may be longer A user job hit a known bug and brought down 256 compute nodes
Resolved Issue 2021-11-01 09:30:00 +0000 2021-11-01 11:30:00 +0000 4-cabinet system compute nodes There were 285 nodes down so queue times may be longer A user job hit a known bug and brought down 256 nodes