2022
Status | Type | Start | End | Scope | User Impact | Reason |
---|---|---|---|---|---|---|
Resolved | Service Alert | 2022-08-11 12:30 | 2022-08-12 11:30 | Compute nodes | Increased queue times and reduced node availability | All nodes now returned to service. Due to high temperatures in the Edinburgh area and to ease the load on the cooling system some nodes were removed from service. Users could connect to ARCHER2, access data and submit jobs to the batch system. |
Resolved | Service Alert | 2022-07-19 14:00 | 2022-07-19 18:30 | Full service | Login access not available, no jobs allowed to start, running jobs will have failed | An internal DNS failure on the system has stopped components communicating |
Resolved | Service Alert | 2022-07-18 14:00 | 2022-07-20 10:00 | Compute nodes | Increased queue times and reduced node availability | Reduced number of compute nodes available as high temperatures in the Edinburgh area are creating cooling issues. Around 4700 compute nodes are currently available and we are continuing to monitor. Full service restored at 10:00 20th. |
Resolved | Issue | 2022-07-14 22:00 | 2022-07-15 13:50 | Full ARCHER2 system | No access to system |
A significant power outage affecting large areas of the Lothians caused ARCHER2 to shut down around 10pm on 14th July. Our team are on site and working to restore service as soon as possible. |
Resolved | At-Risk | 2022-07-06 09:00 | 2022-07-06 16:00 | /home filesystem and ARCHER2 nodes | No user impact expected | Test new configuration for /home filesystem for connection to PUMA |
Resolved | At-Risk | 2022-06-29 10:00 | 2022-06-29 12:00 | Four cabinets will be removed from service and then returned to service | Reduced number of compute nodes available to users | Successfully tested the phased approach for future planned work |
Resolved | Service Alert | 2022-06-15 20:30 | 2022-06-16 16:30 | Compute nodes in cabinets 21-23 | 734 compute nodes were unavailable to users | Power issue with cabinets 21-23 |
Resolved | At-Risk | 2022-06-08 09:00 | 2022-06-13 16:00 | Installation of new Programming Environment (22.04) | No user impact is expected. The new PE will not be the default module so users will need to load it if they wish to use the latest version. | New updated PE is available with several improvements and bug fixes. |
Resolved | Issue | 2022-05-25 11:30 | 2022-05-25 12:15 | Issue with Slurm accounting DB. | Some users are seeing issues with submitting jobs or using sacct commands to query accounting records. | HPE systems have resolved this issue. |
Resolved | Issue | 2022-05-24 14:40 | 2022-05-24 20:40 | Login nodes, compute nodes, running jobs | Users may experience issues with interactive access on login nodes. Running jobs may have failed. No new jobs will be allowed to start. | Investigations on the underlying issue are ongoing. |
Resolved | Issue | 2022-05-19 09:00 | 2022-05-20 16:00 | A hardware issue with one of the ARCHER2 worker nodes has caused a number of the compute nodes to go into 'completing' mode. These compute nodes require a reboot. | Some compute nodes are unavailable as reboots are performed. Some user jobs will not complete fully and other may fail. These jobs should not be charged but please contact support@archer2.ac.uk if you want to confirm whether a refund is required. | Hardware issue with a worker node. |
Resolved | Issue | 2022-05-09 09:00 | 2022-05-23 12:00 | The Slurm batch scheduler will be upgraded. In total the work will take around a week but user impact is expected to be limited to 6 hours when users will not be able to submit new jobs and new jobs will not start |
Wednesday 11th May 1000 – 1715 (users notified when work was completed) and Monday 23rd May 1000 – 1200 Running jobs will not be impacted but users will not be able to submit new jobs and new jobs will not start. |
Updating the Slurm software. |
Resolved | Issue | 2022-03-30 10:00 | 2022-03-30 11:00 | The Slurm batch scheduler will be updated with a new fair share policy. | Running jobs will not be impacted but users will not be able to submit new jobs. If users are impacted, they should wait and then resubmit the job once the work has completed. | Updating the Slurm configuration. |
Resolved | Issue | 2022-03-30 11:00 | 2022-03-30 14:20 | Slurm scheduler | Users may see issues with submitting jobs: `Batch job submission failed: Requested node configuration is not available` and `srun: error: Unable to create step for job 1351277: More processors requested than permitted` which will see jobs not submit or fail at runtime. | Part of this morning's update has been rolled back and we believe this issue is thus now resolved. |
Resolved | Issue | 2022-03-24 15:00:00 +0000 | 2022-03-24 15:45:00 +0000 | 67 nodes within cabinets 20-23 | Reduced number of compute nodes being available which may cause longer queue times | An issue has occurred while maintenance work was taking place on cabinets 20-23 |
Resolved | Issue | 2022-03-02 10:00:00 +0000 | 2022-03-02 10:15:00 +0000 | The DNS Server will be moved to a new server. | The outgoing connections from ARCHER2 may be affected for up to five minutes. | Updating of the DNS Server network. |
Resolved | Issue | 2022-02-22 09:30:00 +0000 | 2022-02-23 10:00:00 +0000 | One ARCHER2 cabinet removed from service. | A replacement PDU is required for an ARCHER2 cabinet. Once replaced, the cabinet will be returned to service. | A cabinet of compute nodes are unavailable. |
Resolved | Downtime | 2022-02-21 15:00:00 +0000 | 2022-02-22 09:30:00 +0000 | Work (/work) Lustre file systems. Compute nodes. | Access to ARCHER2 has been disabled so users will not be able to log on. Running jobs may have failed. No new user jobs will be allowed to start. | Work file systems performance issues. A large number of compute nodes are unavailable. |
Resolved | Issue | 2022-02-04 12:05:00 +0000 | 2022-02-03 12:34:00 +0000 | Login nodes | home file system unavailable and consequentially prevented or significantly slowed logins | network connectivity between the frontend and HomeFS was lost |
Resolved | Issue | 2022-02-03 14:30:00 +0000 | 2022-02-03 15:00:00 +0000 | Login nodes | Users may be unable to connect to the login nodes | Planned reboot of switch caused loss of connection to login nodes |
Resolved | Downtime | 2022-02-02 11:30:00 +0000 | 2022-02-02 12:05:00 +0000 | Login issue | Temporary issue with users unable to connect to login nodes | short outage on the ldap authentication server while maintenance work took place |
Resolved | Issue | 2022-01-27 14:30:00 +0000 | 2022-01-27 15:50:00 +0000 | Login and data analysis nodes | Loss of connection to ARCHER2 for any logged in users, an inability to login for users that were not connected, and jobs running on the data analysis nodes ("serial") partition failed. | Login and data analysis nodes needed to be reconfigured |
Resolved | Issue | 2022-01-24 17:30:00 +0000 | 2022-01-25 14:30:00 +0000 | RDFaaaS | Users cannot access their data in /epsrc or /general on the RDFaaS from ARCHER2 | Systems team are investigating intermittent loss of connection between ARCHER2 and RDFaaS |
Resolved | Issue | 2022-01-20 11:40:00 +0000 | 2022-01-20 15:20:00 +0000 | login nodes | Login node 4 was currently the only login node available - users could use login4.archer2.ac.uk directly to access the system | Systems team are investigating |
Resolved | Notice | 2022-01-10 12:00:00 +0000 | 2022-01-10 12:00:00 +0000 | 4-cabinet system | Users can no longer use the ARCHER2 4-cabinet system or access /work data on the 4-cabinet system | The ARCHER2 4-cabinet system was removed from service as planned |
Resolved | Issue | 2022-01-09 10:00:00 +0000 | 2022-01-10 10:30:00 +0000 | Login and Data Analysis nodes, SAFE | Outgoing network access from ARCHER2 systems to external sites was not working. SAFE response was slow or degraded. | DNS issue at datacentre |
Resolved | Issue | 2021-12-31 10:00:00 +0000 | 2022-01-10 10:00:00 +0000 | Compute nodes | Running jobs on down nodes failed, reduced number of compute nodes available | A number of compute nodes are unavailable due to hardware issue |
2021
Status | Type | Start | End | Scope | User Impact | Reason |
---|---|---|---|---|---|---|
Resolved | Issue | 2021-12-20 13:00:00 +0000 | 2021-12-20 14:12:00 +0000 | 4-cabinet service Slurm scheduler | Response from Slurm scheduler was degraded, running jobs unaffected | Slurm scheduler was experiencing issues |
Resolved | Issue | 2021-12-16 04:05:00 +0000 | 2021-12-16 09:41:00 +0000 | Slurm scheduler unavailable, outgoing connections failed | Slurm commands did not work, outgoing connections did not work, running jobs continued without issue | Spine switch became unresponsive |
Resolved | Issue | 2021-11-14 08:00:00 +0000 | 2021-11-15 09:40:00 +0000 | 4-cabinet system monitoring | Load chart on website missing some historical data | A login node issue caused data to not be collected |
Resolved | Issue | 2021-11-05 08:00:00 +0000 | 2021-11-05 11:30:00 +0000 | 4-cabinet system compute nodes | A large number of compute nodes were unavailable for jobs | A power incident in the Edinburgh area caused a number of cabinets to lose power |
Resolved | Issue | 2021-11-03 14:45:00 +0000 | 2021-11-03 14:45:00 +0000 | 4-cabinet system login access | The login-4c.archer2.ac.uk address is unreachable so logins via this address will fail, users can use the address 193.62.216.1 instead | A short network outage at the University of Edinburgh caused issues with resolving the ARCHER2 login host names |
Resolved | Issue | 2021-11-01 09:30:00 +0000 | 2021-11-01 11:30:00 +0000 | 4-cabinet system compute nodes | There are 285 nodes down so queue times may be longer | A user job hit a known bug and brought down 256 compute nodes |
Resolved | Issue | 2021-11-01 09:30:00 +0000 | 2021-11-01 11:30:00 +0000 | 4-cabinet system compute nodes | There were 285 nodes down so queue times may be longer | A user job hit a known bug and brought down 256 nodes |