An accidental wipe command brings down a critical production database server
Feb 17, 2020
A Korean based managed service provider attempted to make configuration changes to their client’s NetApp system when an engineer incorrectly started a ‘dd’ command on some LUNs, effectively wiping the data that was part of the of end user’s production Sybase server
Without access to the data, the managed service provider potentially faced loss of contract from their client, as well as potential liability costs.
The client had a NetApp FAS8060 system containing 161 x 900GB SAS HDDs, arranged into two separate aggregates (68 drives + 93 drives). The customer was presenting 3 x 468GB FC LUNs from each aggregate out to a Sybase server. The 6 total LUNs were combined into a single Disk Pool, with three logical volumes carved out of the Pool. An incorrect ‘dd’ command had written zeroes to approximately 45GB of one of the logical volumes, and this volume was no longer visible to the Sybase server.
During the original consultation, our engineer instructed the customer to bring the aggregates offline to avoid any further overwrite damage. The aggregates were brought offline with 12 hours from when the original data loss event occurred. The client presented all 161 HDDs from both aggregates to a single Windows machine and connected this to Ontrack’s RDR (Remote Data Recovery) server. Initial inspection showed that both aggregates were named “aggrO,” which eliminated our engineer’s ability to automatically rebuild the aggregate. The drives were sorted into aggregate groups and the aggregates were manually rebuilt. Our engineers were then able to rebuild the aggregates to a point in time as close as possible, but prior to the ‘dd’ damage occurring, with the separate aggregates rebuilt to a point in time within two minutes of each other.
Our engineer was unable to extract or examine the internal data because the logical volumes were used as RAW storage by the Sybase server. All six LUNs were then extracted as flat files to external storage. NetApp support was able to assist to present these LUNs back to the Sybase server. The recovered logical volumes passed integrity checks on the Sybase server and the client confirmed that everything was working properly. The end user’s database server was able to be brought back online within a few days of the failure with no loss of data.