2024
Status | Type | Start | End | Quarter | Scope | User Impact | Reason |
---|---|---|---|---|---|---|---|
Completed | Full | 2024-05-08 09:00 | 2024-05-08 21:00 | Full ARCHER2 System | Users will not be able to connect to the login nodes, jobs will not run and users will be unable to access data during this maintenance | Replacement of operating system certificates | |
Completed | Slurm | 2024-05-01 09:00 | 2024-05-01 10:35 | Slurm maintenance | Running jobs will continue to run, but Slurm commands will be unavailable for a few minutes when the controller restarts. | Required maintenance | |
Completed | Partial | 2024-03-07 09:00 | 2024-03-07 12:30 | RDFaaS /epsrc and /general file systems | Users will not be able to access data on /epsrc and /general during this maintenance | Replacement of Power Supply Unit (PSU) on the RDFaas (E1000) | |
Completed | Partial | 2024-02-07 09:00 | 2024-02-12 14:30 | RDFaaS /epsrc and /general file systems | Users will not be able to access data on /epsrc and /general during this maintenance | Updating the software on the RDFaas (E1000) | |
Completed | Partial | 2024-01-09 12:00 GMT | 2024-01-18 12:00 GMT | Q1 2024 | ARCHER2 | Users will be able to connect to ARCHER2 and access their data. Jobs will run but there will be several periods when users will be unable to submit jobs and new user jobs will not start. If you experience issues, please wait a few minutes and then try to submit the job again. | Integrating the GPU nodes into ARCHER2 |
Completed | Full | 2024-01-08 09:00 | 2024-01-09 12:00 | Q1 2024 | ARCHER2 | Users unable to connect to ARCHER2, existing queued jobs able to run from 20:00 GMT on 8 Jan 2024, users will not have access to data on ARCHER2. | Integrating the GPU nodes into ARCHER2 |
2023
Status | Type | Start | End | Quarter | Scope | User Impact | Reason |
---|---|---|---|---|---|---|---|
Completed | Partial | 18 September 2023 09:00 | 22 September 2023 11:55 | Q3 2023 | ARCHER2 |
No login access No access to any data on the system Jobs will continue to run, and queued jobs will be started as usual Serial QoS will not be available The SAFE will be available during the outage but there will be reduced functionality due to the unavailability of the connection to ARCHER2 such as resetting of passwords or new account creation. |
Upgrade of network |
Completed | Partial | 23 August 2023 10:00 | 23 August 2023 10:50 | Q3 2023 | ARCHER2 | ARCHER2 users unable to submit new jobs |
For a few minutes, users will be unable to submit jobs whilst a roll out of a Slurm configuration change takes place. This change will provide increased resilience and issue monitoring. |
Complete | Full | 19 May 2023 14:00 | 12 June 2023 12:00 | Q2 2023 | ARCHER2 | ARCHER2 unavailable to users | Major software upgrade of ARCHER2. Full details in the ARCHER2 documentation |
2022
Status | Type | Start | End | Quarter | Scope | User Impact | Reason |
---|---|---|---|---|---|---|---|
Emergency | Full | 2022-10-17 09:00 | 2022-10-17 13:15 | 2022_q4 | ARCHER2 | ARCHER2 unavailable to users | Slingshot interconnect reboot to allow the return of failed links which are causing job failures |
Not Required | Full | 2022-03-30 | 2022-03-30 | 2022_q1 | Scheduled maintenance | Not Required | Not Required |
Completed: RFC0093 | Partial : Login and Serial Nodes | 2022-01-26 10:00 | 2022-01-26 12:20 | 2022_q1 | ARCHER2 Login and Serial Nodes | Users will be unable to connect to ARCHER2 and no jobs will run on the serial nodes | To attach the ARCHER2 /home filesystem to a new network at the Advanced Computing Facility data centre |
2021
Status | Type | Start | End | Quarter | Scope | User Impact | Reason |
---|---|---|---|---|---|---|---|
Completed | Full | 2021-10-26 09:00 | 2021-10-26 17:00 | 2021_q4 | ARCHER2 4Cabinet | Users will be unable to run jobs and the /work filesystem will not be available | Reboot of the High Speed Network (HSN). The River (support) rack was moved to a new protected power supply. The 4cabinet filesystem move to a protected power supply will be completed next week when additional power supplies are available. |
Completed | Partial: RDFaaS | 2021-10-21 09:00 | 2021-10-21 17:00 | 2021_q4 | RDFaaS: /epsrc and /general | Users will be unable to access their files on the RDFaaS i.e. the /epsrc and /general filesystems. | Software upgrade of E1000 System which hosts the RDFaaS |
Completed | Partial: RDFaaS | 2021-10-18 09:00 | 2021-10-18 17:00 | 2021_q4 | RDFaaS: /epsrc and /general | Users will be unable to access their files on the RDFaaS i.e. /epsrc and /general filesystems. | Upgrade and reconfiguration of high speed switches |
Completed | Partial: Compute Nodes | 2021-10-01 09:52 | 2021-10-01 15:44 | 2021_q4 | ARCHER2 4Cabinet: Compute Nodes | Users will be able to connect to User Access Nodes and will be able to submit jobs to the compute nodes. The queued jobs will start once the compute nodes are returned to service. | A power issue at a substation local to the Advanced Computing Facility (ACF). |
Completed | Partial: Compute Nodes | 2021-09-30 08:30 | 2021-09-30 11:30 | 2021_q3 | ARCHER2 4Cabinet: Compute Nodes | Users will be able to connect to User Access Nodes and will be able to submit jobs to the compute nodes. The queued jobs will start once the compute nodes are returned to service. | A switch within the 4 Cabinet Service requires a reboot |
Completed | Full (took place within unplanned outage) | 2021-09-15 09:00 | 2021-09-15 16:00 | 2021_q3 | ARCHER2 4Cabinet: User Access and Compute Nodes | Users will not be able to connect. Jobs can be queued and will start once the service returns | Apply a fix for Singularity Issue. |
Completed | Unplanned Full | 2021-09-14 11:00 | 2021-09-15 16:00 | 2021_q3 | ARCHER2 4Cabinet: User Access and Compute Nodes | Users will not be able to connect. | Power Issues within the Edinburgh area |
Completed | At-risk | 2021-09-07 | 2021-09-09 | 2021_q3 | ARCHER2 4Cabinet: User Access and Compute Nodes | Momentary interruptions to connections to UANs | Allow our HPC Systems team to move the ARCHER2 4 cabinet system to a new Network at the Advanced Computing Facility (ACF) |
Completed | Full | 2021-08-25 14:00 | 2021-08-26 11:15 | 2021_q3 | ARCHER2 4Cabinet: User Access and Compute Nodes | No access to the UANs and the queues will start to drain on the compute nodes from Monday 23rd August at 1400 | Allow HPE Systems team to apply an essential security patch to the ARCHER2 4 cabinet system |
Completed | At-risk | 2021-08-25 | 2021-08-25 | 2021_q3 | ARCHER2 4Cabinet: User Access and Compute Nodes | Momentary loss of external network traffic to the User Access Nodes (UAN) on ARCHER2 | Allow our HPC Systems team to move the ARCHER2 4 cabinet system to a new Network at the Advanced Computing Facility (ACF) |
Completed | At-risk | 2021-08-18 10:00 | 2021-08-18 15:00 | 2021_q3 | ARCHER2 4Cabinet: User Access and Compute Nodes | Connection to the User Access Nodes (UAN) on ARCHER2 may be lost. File transfers may be affected. Jobs running on the compute nodes will not be impacted | Allow our HPC Systems team to move the ARCHER2 4 cabinet system to a new Network at the Advanced Computing Facility (ACF) |
Completed | Unplanned | 2021-07-28 12:00 | 2021-07-28 15:00 | 2021_q3 | ARCHER2 4Cabinet: Compute and file system | Running jobs failed, no jobs able to start, filesystem unavailable | Power issue at ACF |
Completed | Unplanned | 2021-07-20 | 2021-07-22 | 2021_q3 | ARCHER2 4Cabinet: User Access and Compute Nodes | Prevented new jobs from starting on the system to reduce the impact on users. Some running jobs may also have crashed as a result of this issue but any currently running should be unaffected | Issue with the interconnect on the ARCHER2 service that causes some new jobs to fail on MPI initialisation |
Completed | Full | 2021-06-29 06:00 | 2021-06-29 23:00 | 2021_q2 | ARCHER2 4Cabinet: Full system | System unavailable to users | Essential power work at the ACF |
Completed | Full | 2021-04-28 09:00 | 2021-04-28 11:00 | 2021_q2 | ARCHER2 4Cabinet: Full system | Users were able to access data and the User Access Nodes (UANs) throughout the maintenance session. Installing this PE required a reboot of the compute nodes. | A new version of the HPE Cray Programming Environment was installed to address memory leaks that were affecting a significant number of users and to help users prepare for the main ARCHER2 system. |
Completed | Full | 2021-03-18 09:00 | 2021-03-18 12:30 | 2021_q1 | ARCHER2 4Cabinet | System unavailable to users | High Speed Network (HSN) rebooted to allow the return of failed links which were causing job failures |
Completed | Full | 2021-02-07 21:15 | 2021-02-18 14:00 | 2021_q1 | ARCHER2 4Cabinet | System unavailable to users | Updated the system to software v1.3.3, this included patched for the critical ‘sudo’ vulnerability |
Completed | Unplanned | 2021-02-07 21:15 | 2021-02-09 09:30 | 2021_q1 | ARCHER2 4Cabinet | System unavailable to users | Power outage affecting SE Scotland |
Completed | Full | 2021-02-04 08:00 | 2021-02-04 14:00 | 2021_q1 | ARCHER2 4Cabinet | System unavailable to users | High Speed Network (HSN) rebooted to allow the return of failed links which were causing job failures |