The server has been stable & serving sites without issue since 9:10AM Pacific. The new replacement disks appear to be working well.
The decision to take the disks offline was made due to the significant risk of running the cluster with only 1 well performing drive. To prevent data corruption & possible downtime during peak daytime periods, we made the decision to take the disks offline during an off-peak, low traffic period.
The cloning took approx. 1 hour longer than expected, due to the 4th drive refusing to clone in a timely fashion. At approx. 8:40AM Pacific, the decision was made to restart the cluster with 3 good disks.
The fourth disk will be replaced during another off-peak period, but that will only require a server reboot, which is less than 2 minutes of downtime.
We will list that as maintenance when it's scheduled.
We are now testing each site to ensure they are loading without issue. More details to follow.
Cluster is now responding using the 3 drives. Sites are responding. More details to follow.
The data centre is going to reboot the cluster with the 3 cloned drives. The 4th drive was refusing to clone in a timely fashion. Will post an update here once we have more information to pass along.
The last drive clone is currently stuck around 45%. New ETA will be posted once we get an update from the data centre.
The last drive is now being cloned. The disks should be back online within 1 hour.
Over the last 24 hours, we have noticed higher than normal I/O wait times from the disks on the vancouver30 cluster. We use raid for redundancy, but our testing is showing that 3 out of 4 drives are likely going to fail. We have never seen that before, so this is considered an emergency. We will be taking all of the drives offline to be cloned to brand new drives. This will cause downtime, but is required to prevent data corruption. We expect that downtime to be 2 - 3 hours. Based on that, we hope to have the new disks back online by 8AM ET.
Clients that use our DNS will have their sites served from a failover server during the main cluster emergency maintenance.