2024
Status | Type | Start | End | Scope | User Impact | Reason |
---|---|---|---|---|---|---|
Resolved | Service Alert | 2024-10-01 09:45 | 2024-10-01 12:15 | ARCHER2 work (fs1) file system |
Slow response to access data on fs1 work file system. `module` commands show slow response. New work was stopped, but is now being started once more (11.45). |
fs1 issues now resolved - please contact the service desk if you see any further problems. |
Resolved | Service Alert | 2024-09-30 0630 | 2024-09-30 1400 | ARCHER2 work (fs4) file system | Some data or directories on the file system may be inaccessible. Trying to access inaccessible data may cause the terminal to hang. | An OSS has failed and failover did not happen successfully |
Resolved | Service Alert | 2024-09-25 16:00 | 2024-09-26 15:00 | Slurm scheduler | Intermittent issues running Slurm commands | |
Resolved | Service Alert | 2024-09-24 15:30 | 2024-09-24 17:00 | ARCHER2 work (fs1) file system | Slow response to access data on fs1 work file system. `module` commands show slow response. | Contention for file system resources |
Resolved | Service Alert | 2024-09-24 08:00 | 2024-09-26 14:00 | ARCHER2 queues | Users may observe slightly longer queue times for other work while some nodes are reserved for the Capability QoS. |
ARCHER2 Capability Days The third ARCHER2 Capability Days session will run from 24-26 September 2024. |
Resolved | Issue | 2024-08-02 09:00 | 2024-08-09 17:00 | RDFaaS (file systems /epsrc and /general) DMF tape backup system | The tape backup service will be unavailble for the week. This means that no new data will be backed up during the week so there is a small risk that new data could be lost during this week. Once the service is resumed a catch up back up will take place which means all data will then be backed up. | Physical moving of DMF tape drive at the Advanced Computing Facility (ACF) |
Resolved | Service Alert | 2024-07-18 20:30 | 2024-07-19 11:54 | ARCHER2 work (fs3) file system | Issues accessing data on work (fs3) file system | Normal service has been restored |
Resolved | Issue | 2024-06-20 12:30 | 2024-06-20 12:40 | The Slurm batch scheduler will be updated with a new certificate | Running jobs will not be impacted but users will not be able to submit new jobs for a brief period (around 5-10 minutes). If users are impacted, they should wait and then resubmit the job once the work has completed. | Updating the Slurm certificate. |
Resolved | Service Alert | 2024-06-05 14:15 | 2024-06-05 20:40 | ARCHER2 Capability Days compute nodes | Capability Days jobs not starting | Previous Capability Days job I/O caused nodes to fail |
Resolved | Service Alert | 2024-06-05 14:15 | 2024-06-05 20:40 | ARCHER2 work (fs3) file system | Issues accessing data on work (fs3) file system | Failure of Lustre server node and automatic failover of Lustre server node did not succeed |
Resolved | Service Alert | 2024-06-04 08:00 | 2024-06-06 14:00 | ARCHER2 queues | Users may observe slightly longer queue times for other work while some nodes are reserved for the Capability QoS. |
ARCHER2 Capability Days The second ARCHER2 Capability Days session will run from 4-6 June 2024. |
Resolved | Service Alert | 2024-05-27 03:09 | 2024-05-27 13:00 | ARCHER2 compute nodes | CPU compute nodes are unavailable, any jobs running at the time of the power incident will have failed. GPU nodes are available | Power incident on UK national grid in the Edinburgh area resulted in loss of power to ARCHER2 compute nodes |
Resolved | Service Alert | 2024-05-22 11:30 | 2024-05-22 13:30 | Access to License server | The License server is inaccessible - our team are working to restore | |
Resolved | At-Risk | 2024-05-20 10:00 | 2024-05-25 | ARCHER2 Nodes | A rolling-reboot to update the compute nodes on ARCHER2 which includes the newer CPE (Cray Programming Environment) 23.09. This will not impact running work but once jobs finish, compute nodes will be rebooted and then be returned to service with the new updated software. Serial work is unaffected. | |
Resolved | Service Alert | 2024-05-09 10:00 | 2024-05-09 12:00 | ARCHER2 rundeck ticketing server | May be a delay in processing new user requests via SAFE | Physical moving of the server hosting the rundeck ticketing system |
Resolved | Service Alert | 2024-05-08 14:00 | 2024-05-08 14:08 | Connectivity to ARCHER2 may have a short outage but no impact is expected | We do not expect any user impact but if there is an issue it will be a short connectivity outage | Changing power supply for the JANET CIENA unit |
Resolved | Issue | 2024-04-26 08:25 | 2024-04-26 10:00 | Serial nodes | Serial node dvn01 is currently unavailable. Serial jobs are queued and running but performance may be slower than usual until the issue is resolved. | |
Resolved | Service Alert | 2024-04-25 09:30 | 2024-04-25 10:40 | Serial Nodes, DVN01 and DVN02 | Users will not be able to use the serial nodes. This means members of n02 will not be able to run jobs as their workflow depends on the serial nodes. We appreciate this is both critical and urgent for this project and HPE are investigating. | The heavy load on the metadata server may have impacted the slurm controller and caused the slurm deamon to fail on these nodes. Investigation is ongoing. |
Resolved | Service Alert | 2024-04-25 14:00 | 2024-04-25 16:00 | Connectivity to ARCHER2 may have a short outage but no impact is expected | We do not expect any user impact but if there is an issue it will be a short connectivity outage | Changing power supply for the JANET CIENA unit |
Resolved | Service Alert | 2024-04-24 08:00 | 2024-04-24 13:30 | Emails from Service Desk | We believe that emails being sent from the ARCHER2 Service Desk are being delayed downstream, causing them not to be received promptly. We are working to resolve. | |
Resolved | Service Alert | 2024-04-22 08:00 | 2024-04-22 12:00 | Connectivity to ARCHER2 may have a short outage but no impact is expected | We do not expect any user impact but if there is an issue it will be a short connectivity outage | Changing power supply for the JANET CIENA unit |
Resolved | Service Alert | 2024-04-15 14:00 | 2024-04-15 16:00 | ARCHER2 rundeck ticketing server | May be a delay in processing new user requests via SAFE | Physical moving of the server hosting the rundeck ticketing system |
Resolved | Service Alert | 2024-04-15 15:30 | 2024-04-15 16:40 | ARCHER2 login node | Users cannot currently connect to ARCHER2 | Physical moving of the server hosting the ARCHER2 ldap server |
Resolved | Service Alert | 2024-04-15 10:00 | 2024-04-15 10:30 | Outage to DNS server which will impact ARCHER2 and ARCHER2 SAFE | Users can still connect to service but may be unable to access external websites (eg GitLab) | Migration of server in preparation of the wider power work affecting site the following week |
Resolved | Service Alert | 2024-04-11 10:00 | 2024-04-11 10:40 | ARCHER2 rundeck ticketing server | May be a delay in processing new user requests via SAFE | Migration of the rundeck ticketing system |
Resolved | Service Alert | 2024-04-09 10:00 | 2024-04-09 11:00 | ARCHER2 slurm scheduler |
ARCHER2 Slurm Controller will be restarted this morning. Running jobs will continue to run, but Slurm commands will be unavailable for a few minutes. |
Adjustment of a scheduling parameter |
Resolved | Service Alert | 2024-04-06 21:37 | 2024-04-06 23:45 | ARCHER2 work4 (fs4) file system | Partial loss of access to work4 (fs4) for a short while | HPE Support are investigating root cause. |
Resolved | Service Alert | 2024-04-23 12:00 | 2024-04-24 15:30 | ARCHER2 work (fs3) file system |
Slow response when accessing data on the file system. Update 24th April: We are continuing to investigate and our on-site HPE support team have escalated the issue. Darshan IO monitoring has been enabled for all jobs to help identify the issue. |
Extreme load on metadata server. |
Resolved | Service Alert | 2024-03-27 11:45 GMT | 2024-03-28 11:45 GMT | All parallel jobs launched using srun | All parallel jobs launched using `srun` will have their IO profile captured by the Darshan IO profiling tool. In rare cases this may cause jobs to fail or impact performance. Users can disable Darshan by adding the line `module remove darshan` before they use `srun` in their job submission scripts. | Capturing data on the IO use on ARCHER2 to improve the service. |
Resolved | Service Alert | 2024-03-24 23:00 | 2024-03-25 11:00 | Most ARCHER2 compute nodes | Users will not be able to run jobs on most of the ARCHER2 compute nodes. Jobs running on compute nodes at the time of the incident will also have failed. | Some compute nodes temporarily lost power and are in the process of being brough back into service. |
Resolved | Service Alert | 2024-03-21 11:15 | 2024-03-21 18:30 | RDFaaS filesystem | Users may experience issues accessing files on the RDFaaS which includes /epsrc and /general file systems. | The e1000 which hosts RDFaaS is experiencing issues. |
Resolved | Service Alert | 2024-03-12 09:30 | 2024-03-13 15:30 | ARCHER2 GPU nodes |
The ARCHER2 GPU nodes are reserved Tuesday 12-03-2024 from 09:30 to 17:00 Wednesday 13-03-2024 from 09:30 to 15:30 |
The GPU nodes are being used for a training course. Normal access will be restored at 15:30 on Wednesday when the course ends. |
Resolved | Service Alert | 2024-03-08 09:00 | 2024-03-12 18:30 | Compute nodes | We are currently using a rolling-reboot to update the compute nodes on ARCHER2. This will not impact running work but once jobs finish, compute nodes will be rebooted and then be returned to service with the new updated software. Serial work is unaffected. | Updates to ARCHER2 compute nodes |
Resolved | Service Alert | 2024-02-23 12:30 | 2024-02-23 16:40 | Some ARCHER2 compute nodes after a power outage | New jobs had been stopped and some nodes down. New jobs now running and almost all nodes back in service. | Some compute nodes temporarily lost power |
Resolved | Service Alert | 2024-02-15 14:50 GMT | 2024-02-15 15:50 | ARCHER2 work3 (fs3) file system | Very slow response when accessing data on the file system. | Extreme load on metadata server. |
Resolved | Service Alert | 2024-01-30 09:50 GMT | 2024-01-31 09:50 GMT | All parallel jobs launched using srun | All parallel jobs launched using `srun` will have their IO profile captured by the Darshan IO profiling tool. In rare cases this may cause jobs to fail or impact performance. Users can disable Darshan by adding the line `module remove darshan` before they use `srun` in their job submission scripts. | Capturing data on the IO use on ARCHER2 to improve the service. |
Resolved | Service Alert | 2024-01-29 08:00 GMT | 2024-01-29 14:30 | ARCHER2 work3 (fs3) file system | Very slow response or timeout errors when accessing data on the file system. Running jobs using fs3 will have been killed. | Our monitoriing has detected issues accessing data on the file system and we are investigating |
Resolved | Service Alert | 2024-01-16 09:00 GMT | 2024-01-19 | Compute nodes | We are currently using a rolling-reboot to update the compute nodes on ARCHER2. This will not impact running work but once jobs finish, compute nodes will be rebooted and then be returned to service with the new updated software. Serial work is unaffected. | Updates to ARCHER2 compute nodes |
Resolved | Service Alert | 2024-01-11 10:00 | 2024-01-11 11:00 | Slurm | Users may see short interruptions to Slurm functionality (e.g. `sbatch`, `squeue` commands). If you experience issues please wait a couple of minutes and try again. | Slurm software is being updated on the system |
Resolved | Service Alert | 2024-01-15 11:00 | 2024-01-15 15:00 | Outage to slurm scheduler | Users will be able to connect to ARCHER2, access their data but slurm will be unavailable during this work. Running jobs will continue but users will not be able to submit new jobs. Users will be notified when slurm is available from the login nodes. | Slurm software is being updated to integrate the GPU nodes |
Resolved | Service Alert | 2024-01-07 11:30 GMT | 2024-01-08 09:30 GMT | ARCHER2 work3 (fs3) file system | Some users may see slower than normal file system response | HPE engineers have detected slow response time from the work3 file system and are investigating |
2023
Status | Type | Start | End | Scope | User Impact | Reason |
---|---|---|---|---|---|---|
Resolved | Service Alert | 2023-10-19 07:45 | 2023-10-19 10:30 | 3 Cabinets of Compute Nodes | There was a power incident which caused a power outage to three ARCHER2 cabinets. Power has been restored to the cabinets but some user jobs will have failed. Jobs should not have been charged. | Severe weather in the local area |
Resolved | Service Alert | 2023-12-12 12:00 | 2023-12-21 12:00 | ARCHER2 file system | Some users report experiencing slower than normal file system response | We are investigating and monitoring the issue. |
Resolved | Service Alert | 2023-12-06 09:00 | 2023-12-06 10:00 | ARCHER2 login nodes | Users will now connect using a ssh key and passphrase and a time-based one time password | Enhance security |
Resolved | Service Alert | 2023-11-22 10:00 | 2023-11-22 10:10 | /work lustre file systems | The change should take minutes and should not impact users | Change in lustre configuration |
Resolved | Service Alert | 2023-11-06 09:20 | 2023-11-06 11:40 | Compute nodes | All running jobs have failed, no new jobs can start on the compute nodes. | An extenal power event has caused all ARCHER2 compute nodes to be unavailable |
Resolved | Service Alert | 2023-10-30 10:00 | 2023-11-01 10:00 | Slurm | Users may see short interruptions to Slurm functionality (e.g. `sbatch`, `squeue` commands) | Slurm software is being updated on the system |
Resolved | Service Alert | 2023-09-28 17:00 | 2023-09-29 16:00 | ARCHER2 rolling reboot |
We are currently using a rolling-reboot to update some of the nodes on ARCHER2. Whist this is ongoing, existing running work will continue but some new work will not be started. Serial work is unaffected. |
Updates to some ARCHER2 nodes |
Resolved | Service Alert | 2023-09-01 09:00 | 2023-10-05 | ARCHER2 work (Lustre) file systems | Users may see errors such as "input/outut" errors when accessing data on Lustre file systems | A patch was installed to address the known Lustre bug which may have causes these issues |
Resolved | Service Alert | 2023-08-10 09:00 | 2023-08-10 12:35 | Work file system 3 (fs3) | Commands on fs3 work file system will hang or become unresponsive | No new jobs will start until HPE have resolved the issue. |
Resolved | Service Alert | 2023-06-15 15:00 | 2023-07-15 22:00 | ARCHER2 compute nodes | Jobs may fail intermittently with memory issues (e.g. OOM, hanging with no output, segfault) | A kernel memory leak affected compute nodes with reduced memory being available on nodes over time. HPE have issued a patched kernel for the issue which has been tested by the CSE team and has demonstrated clear improvements in the memory usage. The patch has been applied to all compute nodes. |
Resolved | Service Alert | 2023-06-13 14:45 | 2023-06-14 14:40 | VASP5 module out of memory errors | The CSE team is aware that the VASP5 module is returning "Out of memory" errors with some kinds of simulations. | Module has been rebuilt and issue is resolved. |
Resolved | Service Alert | 2023-06-12 12:55 | 2023-06-12 17:30 | Work file system 3 (fs3) | Commands on fs3 work file system will hang or become unresponsive | Issue was identified as a user job and also an issue with an OST (Object Storage Target). OST was switched out and user asked to remove jobs for further investigation. An issue with the fs3 Lustre file system. Other file systems are working as normal. |
Resolved | Service Alert | 2023-04-18 01:30 | 2023-04-18 12:52 | Login nodes, compute nodes, Lustre file systems | No access to ARCHER2, jobs running at the time of the issue will have failed, no new jobs allowed to start | We are investigating an issue with the Slingshot interconnect on ARCHER2 which has caused compute nodes and Lustre file systems to lose connectivity |
Resolved | Service Alert | 2023-04-06 11:45 | 2023-04-06 16:45 | Slurm scheduler | Users may see issues submitting jobs to Slurm, with the behaviour of jobs and with issuing Slurm commands | We are investigating an issue with the Slurm scheduler |
Resolved | Service Alert | 2023-03-28 09:45 | 2023-03-28 19:25 | Login nodes. Compute nodes |
Update 1925 BST 28 Mar 2023 Compute nodes are now returned to service and reservation has been removed so jobs will now run (30 nodes missing and will hopefully be returned to service tomorrow morning). Update 1845 BST 28 Mar 2023 - login access is available, compute nodes in process of being brought back into service. Jobs can be submitted and will start once the compute nodes are available. Original impact - new login sessions have been blocked. Existing login sessions may become unresponsive. All new jobs on the compute nodes have been prevented from starting. Current running work may fail or run slow. |
We are investigating issues with instability on the ARCHER2 backend cluster |
Resolved | Service alert | 2023-02-14 10:15 GMT | 2023-02-14 13:30 GMT | /work file system (fs2), possible intermittent issues on login/compute nodes | Projects with directories on the fs2 /work file system will not be allowed to run jobs (as the resources may just be wasted) and may see that some data is inaccessible. Users may see occassional issues on login/compute nodes as they try to access the filesystem. You can check which file system your work directory is on by navigating to the location and using the command readlink -f . | Heavy i/o caused by a user's jobs. CSE team will work with the user. |
Resolved | Partial | 2023-02-07 18:40 | 2023-02-08 | Four ARCHER2 cabinets currently unavailable. The remainder of ARCHER2 is continuing to run as normal. | All jobs running on the affected cabinets will have failed. These should not have been charged. | Four ARCHER2 cabinets experienced a power interruption. Cabinets returned to service. |
Resolved | Service alert | 2023-01-23 11:30 GMT | 2023-01-23 15:15 GMT | /work file system (fs3), possible intermittent issues on login/compute nodes | Projects with directories on the fs3 /work file system will not be allowed to run jobs (as the resources may just be wasted) and may see that some data is inaccessible. Users may seem occaissional issues on login/compute nodes as they try to access the failed OSS. You can check which file system your work directory is on by navigating to the location and using the command readlink -f . | Issue is now resolved following a Lustre OSS failure in the fs3 /work file system |
2022
Status | Type | Start | End | Scope | User Impact | Reason |
---|---|---|---|---|---|---|
Resolved | Partial | 2022-12-14 09:40 | 2022-12-14 15:30 | Around 2000 nodes are unavailable. | User jobs may take longer to run. Users can still connect to ARCHER2 and submit jobs. | Full service is now available following a power and possible network issue. |
2023
Status | Type | Start | End | Scope | User Impact | Reason |
---|---|---|---|---|---|---|
Resolved | Service change | 2022-12-12 09:45 GMT | 2023-12-12 09:45 GMT | Compute nodes | Users may see changes in application performance | The default CPU frequency for parallel jobs started using `srun` has been changed to 2.0 GHz to improve the energy efficiency of ARCHER2. We recommend that users test the energy efficiency of their applications and set the CPU frequency appropriately. |
2022
Status | Type | Start | End | Scope | User Impact | Reason |
---|---|---|---|---|---|---|
Resolved | Service alert | 2022-12-06 09:00 GMT | 2022-12-06 12:00 GMT | /work file system (fs3) |
Work to integrate the additional ARCHER2 Lustre file system may result in disruption to one of the existing work file systems: fs3. This will impact any projects who have work storage on fs3. If the work caused any impact to running jobs please contact the service desk with a note of the affected job IDs. |
|
Resolved | Service alert | 2022-11-14 09:00 GMT | 2022-12-02 09:00 GMT | ARCHER2 Compute Nodes | Scottish Power Energy Networks (SPEN) have announced a three week at-risk period for the power input to the ACF building. We are hopeful that there will be no user impact but want to share the alert with users. If there is a power blip, we anticipate there will not be any impact to the login nodes but expect there may be an interruption to the compute nodes. Further details will be provided if there are any issues. | |
Resolved | Service alert | 2022-11-13 10:00 GMT | 2022-11-14 09:00 GMT | SAFE website | Users will get a security warning when trying to access SAFE website; some web browsers (e.g. Chrome) will not connect to SAFE website; ARCHER2 load plot on status page will not work | The website certificate has expired |
Resolved | Service alert | 2022-10-24 12:00 | 2022-10-24 18:00 | Administration nodes | Users may experience brief interruptions to node response and delays in Slurm commands | We are updating the system to resolve a security vulnerability |
Resolved | Issue | 2022-10-05 09:00 | 2022-10-17 13:15 | Slingshot interconnect | Some jobs may fail with MPI errors (e.g. "No route to host", "pmi_init failure"), the larger the number of nodes used in the job, the more likely users are to hit the issue. A reboot of Slingshot is planned for Mon 17 Oct 2022 - see maintenance table below for details. | A number of active optical cable links between groups in the interconnect topology are not working |
Resolved | Unplanned | 2022-10-11 12:30 | 2022-10-11 13:00 | DNS issues across the UoE | Users were be able to connect to ARCHER2 or SAFE. | UoE DNS and network issue. |
Resolved | Partial | 2022-09-21 09:40 | 2022-09-21 11:15 | An issue affected the high-speed network for some nodes. | Some node comms may be affected. | HPE have reset the affected switches and the issue is now fixed. |
Resolved | At-Risk | 2022-09-12 09:00 | 2022-10-07 17:00 | Groups of four cabinets will be removed from service. | Reduced number of compute nodes available to users | HPE are carrying out essential work on the ARCHER2 system |
Resolved | Partial | 2022-08-31 09:00 | 2022-08-31 11:00 | No access to login nodes, data and SAFE. Running jobs will not be affected and new jobs will start. | Up to 2 hours loss of connection to ARCHER2 login nodes, data and SAFE access | Essential updates to the network configuration. Users will be notified when full service is resumed. |
Resolved | At-risk | 2022-08-24 09:00 | 2022-08-24 13:00 | This maintenance session has now been downgraded to an 'at-risk' session. We do not expect this work to have any impact on user service. | Users will be able to connect to ARCHER2, access data, submit and run jobs. | Essential updates to the network configuration. |
Resolved | Service Alert | 2022-08-11 12:30 | 2022-08-12 11:30 | Compute nodes | Increased queue times and reduced node availability | All nodes now returned to service. Due to high temperatures in the Edinburgh area and to ease the load on the cooling system some nodes were removed from service. Users could connect to ARCHER2, access data and submit jobs to the batch system. |
Resolved | Service Alert | 2022-07-19 14:00 | 2022-07-19 18:30 | Full service | Login access not available, no jobs allowed to start, running jobs will have failed | An internal DNS failure on the system has stopped components communicating |
Resolved | Service Alert | 2022-07-18 14:00 | 2022-07-20 10:00 | Compute nodes | Increased queue times and reduced node availability | Reduced number of compute nodes available as high temperatures in the Edinburgh area are creating cooling issues. Around 4700 compute nodes are currently available and we are continuing to monitor. Full service restored at 10:00 20th. |
Resolved | Issue | 2022-07-14 22:00 | 2022-07-15 13:50 | Full ARCHER2 system | No access to system |
A significant power outage affecting large areas of the Lothians caused ARCHER2 to shut down around 10pm on 14th July. Our team are on site and working to restore service as soon as possible. |
Resolved | At-Risk | 2022-07-06 09:00 | 2022-07-06 16:00 | /home filesystem and ARCHER2 nodes | No user impact expected | Test new configuration for /home filesystem for connection to PUMA |
Resolved | At-Risk | 2022-06-29 10:00 | 2022-06-29 12:00 | Four cabinets will be removed from service and then returned to service | Reduced number of compute nodes available to users | Successfully tested the phased approach for future planned work |
Resolved | Service Alert | 2022-06-15 20:30 | 2022-06-16 16:30 | Compute nodes in cabinets 21-23 | 734 compute nodes were unavailable to users | Power issue with cabinets 21-23 |
Resolved | At-Risk | 2022-06-09 09:00 | 2022-09-22 16:00 | Compute nodes | No user impact is expected as there is redundancy built into the system | Essential electrical work which will include the cables which feed the Power Distribution Units (PDUs) on ARCHER2 |
Resolved | At-Risk | 2022-06-08 09:00 | 2022-06-13 16:00 | Installation of new Programming Environment (22.04) | No user impact is expected. The new PE will not be the default module so users will need to load it if they wish to use the latest version. | New updated PE is available with several improvements and bug fixes. |
Resolved | Issue | 2022-05-25 11:30 | 2022-05-25 12:15 | Issue with Slurm accounting DB. | Some users are seeing issues with submitting jobs or using sacct commands to query accounting records. | HPE systems have resolved this issue. |
Resolved | Issue | 2022-05-24 14:40 | 2022-05-24 20:40 | Login nodes, compute nodes, running jobs | Users may experience issues with interactive access on login nodes. Running jobs may have failed. No new jobs will be allowed to start. | Investigations on the underlying issue are ongoing. |
Resolved | Issue | 2022-05-19 09:00 | 2022-05-20 16:00 | A hardware issue with one of the ARCHER2 worker nodes has caused a number of the compute nodes to go into 'completing' mode. These compute nodes require a reboot. | Some compute nodes are unavailable as reboots are performed. Some user jobs will not complete fully and other may fail. These jobs should not be charged but please contact support@archer2.ac.uk if you want to confirm whether a refund is required. | Hardware issue with a worker node. |
Resolved | Issue | 2022-05-09 09:00 | 2022-05-23 12:00 | The Slurm batch scheduler will be upgraded. In total the work will take around a week but user impact is expected to be limited to 6 hours when users will not be able to submit new jobs and new jobs will not start |
Wednesday 11th May 1000 – 1715 (users notified when work was completed) and Monday 23rd May 1000 – 1200 Running jobs will not be impacted but users will not be able to submit new jobs and new jobs will not start. |
Updating the Slurm software. |
Resolved | Issue | 2022-04-19 10:00 | 2022-07-19 10:00 | Work file systems | Users may see issues with slow disk I/O and slow response on login nodes. | There is a heavy load on the /work filesystems. HPE are investigating the cause and we will update as we get more information. |
Resolved | Issue | 2022-03-30 10:00 | 2022-03-30 11:00 | The Slurm batch scheduler will be updated with a new fair share policy. | Running jobs will not be impacted but users will not be able to submit new jobs. If users are impacted, they should wait and then resubmit the job once the work has completed. | Updating the Slurm configuration. |
Resolved | Issue | 2022-03-30 11:00 | 2022-03-30 14:20 | Slurm scheduler | Users may see issues with submitting jobs: `Batch job submission failed: Requested node configuration is not available` and `srun: error: Unable to create step for job 1351277: More processors requested than permitted` which will see jobs not submit or fail at runtime. | Part of this morning's update has been rolled back and we believe this issue is thus now resolved. |
Resolved | Issue | 2022-03-24 15:00:00 +0000 | 2022-03-24 15:45:00 +0000 | 67 nodes within cabinets 20-23 | Reduced number of compute nodes being available which may cause longer queue times | An issue has occurred while maintenance work was taking place on cabinets 20-23 |
Resolved | Issue | 2022-03-02 10:00:00 +0000 | 2022-03-02 10:15:00 +0000 | The DNS Server will be moved to a new server. | The outgoing connections from ARCHER2 may be affected for up to five minutes. | Updating of the DNS Server network. |
Resolved | Issue | 2022-02-22 09:30:00 +0000 | 2022-02-23 10:00:00 +0000 | One ARCHER2 cabinet removed from service. | A replacement PDU is required for an ARCHER2 cabinet. Once replaced, the cabinet will be returned to service. | A cabinet of compute nodes are unavailable. |
Resolved | Downtime | 2022-02-21 15:00:00 +0000 | 2022-02-22 09:30:00 +0000 | Work (/work) Lustre file systems. Compute nodes. | Access to ARCHER2 has been disabled so users will not be able to log on. Running jobs may have failed. No new user jobs will be allowed to start. | Work file systems performance issues. A large number of compute nodes are unavailable. |
Resolved | Issue | 2022-02-04 12:05:00 +0000 | 2022-02-03 12:34:00 +0000 | Login nodes | home file system unavailable and consequentially prevented or significantly slowed logins | network connectivity between the frontend and HomeFS was lost |
Resolved | Issue | 2022-02-03 14:30:00 +0000 | 2022-02-03 15:00:00 +0000 | Login nodes | Users may be unable to connect to the login nodes | Planned reboot of switch caused loss of connection to login nodes |
Resolved | Downtime | 2022-02-02 11:30:00 +0000 | 2022-02-02 12:05:00 +0000 | Login issue | Temporary issue with users unable to connect to login nodes | short outage on the ldap authentication server while maintenance work took place |
Resolved | Issue | 2022-01-27 14:30:00 +0000 | 2022-01-27 15:50:00 +0000 | Login and data analysis nodes | Loss of connection to ARCHER2 for any logged in users, an inability to login for users that were not connected, and jobs running on the data analysis nodes ("serial") partition failed. | Login and data analysis nodes needed to be reconfigured |
Resolved | Issue | 2022-01-24 17:30:00 +0000 | 2022-01-25 14:30:00 +0000 | RDFaaaS | Users cannot access their data in /epsrc or /general on the RDFaaS from ARCHER2 | Systems team are investigating intermittent loss of connection between ARCHER2 and RDFaaS |
Resolved | Issue | 2022-01-20 11:40:00 +0000 | 2022-01-20 15:20:00 +0000 | login nodes | Login node 4 was currently the only login node available - users could use login4.archer2.ac.uk directly to access the system | Systems team are investigating |
Resolved | Notice | 2022-01-10 12:00:00 +0000 | 2022-01-10 12:00:00 +0000 | 4-cabinet system | Users can no longer use the ARCHER2 4-cabinet system or access /work data on the 4-cabinet system | The ARCHER2 4-cabinet system was removed from service as planned |
Resolved | Issue | 2022-01-09 10:00:00 +0000 | 2022-01-10 10:30:00 +0000 | Login and Data Analysis nodes, SAFE | Outgoing network access from ARCHER2 systems to external sites was not working. SAFE response was slow or degraded. | DNS issue at datacentre |
Resolved | Issue | 2021-12-31 10:00:00 +0000 | 2022-01-10 10:00:00 +0000 | Compute nodes | Running jobs on down nodes failed, reduced number of compute nodes available | A number of compute nodes are unavailable due to hardware issue |
2021
Status | Type | Start | End | Scope | User Impact | Reason |
---|---|---|---|---|---|---|
Resolved | Issue | 2021-12-20 13:00:00 +0000 | 2021-12-20 14:12:00 +0000 | 4-cabinet service Slurm scheduler | Response from Slurm scheduler was degraded, running jobs unaffected | Slurm scheduler was experiencing issues |
Resolved | Issue | 2021-12-16 04:05:00 +0000 | 2021-12-16 09:41:00 +0000 | Slurm scheduler unavailable, outgoing connections failed | Slurm commands did not work, outgoing connections did not work, running jobs continued without issue | Spine switch became unresponsive |
Resolved | Issue | 2021-11-14 08:00:00 +0000 | 2021-11-15 09:40:00 +0000 | 4-cabinet system monitoring | Load chart on website missing some historical data | A login node issue caused data to not be collected |
Resolved | Issue | 2021-11-05 08:00:00 +0000 | 2021-11-05 11:30:00 +0000 | 4-cabinet system compute nodes | A large number of compute nodes were unavailable for jobs | A power incident in the Edinburgh area caused a number of cabinets to lose power |
Resolved | Issue | 2021-11-03 14:45:00 +0000 | 2021-11-03 14:45:00 +0000 | 4-cabinet system login access | The login-4c.archer2.ac.uk address is unreachable so logins via this address will fail, users can use the address 193.62.216.1 instead | A short network outage at the University of Edinburgh caused issues with resolving the ARCHER2 login host names |
Resolved | Issue | 2021-11-01 09:30:00 +0000 | 2021-11-01 11:30:00 +0000 | 4-cabinet system compute nodes | There are 285 nodes down so queue times may be longer | A user job hit a known bug and brought down 256 compute nodes |
Resolved | Issue | 2021-11-01 09:30:00 +0000 | 2021-11-01 11:30:00 +0000 | 4-cabinet system compute nodes | There were 285 nodes down so queue times may be longer | A user job hit a known bug and brought down 256 nodes |