Case study: dd command causes vital data to disappear
Over the years, the workplace has evolved in many areas. None so prevalent as the IT environment. Today, many companies have had to implement highly complex systems in order to keep up with the advancements in technology. Administrators and employees now have to be aware of a variety of commands in a variety of systems. It is, therefore, no surprise that mistakes are made. Most of these mistakes are minor, but unfortunately, some fall into the major bracket. A great example is the case of a Korean managed service provider in May of 2018, where the incorrect use of the dd command caused vital data to disappear.
So what happened?
Many of you would have heard of the infamous UNIX ‘dd’ command. For those that have not, the commands primary purpose is to convert and copy files.
The main issue with ''dd'' is that it is also the command parameters to delete files, partitions or even entire hard disks. This can happen if the user exchanges the parameters ´if´ and ´or´, as the result is indicative of the assigned abbreviation ´dd.´ The abbreviation is often thought to represent ´destroy disk´ or ´delete data´ instead of the original term ´duplicate data´.
So, the infamous UNIX ´dd´ command was executed incorrectly when the configuration settings for a NetApp system were changed for a client by the managed service provider.
What was the problem?
As a result of the incorrect ‘‘dd’’ command, they had a critical situation. The command resulted in the deletion of important data required for one of its client’s connected Sybase production system.
The customer’s NetApp system had a total of 161, 900GB, SAS hard disks arranged in two separate aggregates (68 + 93 respectively). From each of the two aggregates, three 468 GB LUNs were made available to the Sybase server. All six LUNs were combined into a single data pool. Three logical volumes were separated from this common data pool and presented to Sybase. After the ´dd´ command one of the logical volumes with about 45 GB of data was "zeroed" and no longer addressable or available to the Sybase server.
What did we do?
Unwilling to risk losing a valued customer and aware of potential liability, the Korean provider turned to us for help. Upon hearing the situation, our engineering team advised the customer to shut down the system to prevent further data movement.
A total of 12 hours passed before the customer was finally able to shut down the system to prevent any additional data loss. To ensure no more time was lost, the Korean provider selected our remote data recovery solution. The customer attached the 161 hard disks from both aggregates to a Windows computer, which was then connected via the Internet to our Remote Data Recovery (RDR) server using the secure client they provided. Our RDR option gave them the best mix of speed, security, and flexibility for addressing the urgency and sensitive nature of the situation with their client’s data.
Our data recovery process
Although our research and development team is constantly developing new proprietary tools to support unique data recovery projects, our team quickly identified the two aggregates could not be ‘‘automatically’’ rebuilt by anything we had had already developed. It was determined that in order to facilitate the recovery, we would need additional engineers to conduct a manual recovery.
The drives were sorted according to the two aggregate groups. Our engineers were able to rebuild the units and reconstruct the two aggregates as close as possible to the original point at which the data was lost.
Since the original data on the logical volumes was made available to the Sybase system as RAW storage, our engineers were not able to check the files in the end. All six LUNs were therefore extracted as flat files onto an external storage disk and made available to the customer.
To re-establish the six LUNs on the NetApp system and reconnect them to the Sybase server, NetApp support was consulted. After the recovered logical volumes passed the integrity checks on the Sybase server and the client confirmed that everything was working fine. The end customer's database server was back online only a few days after the failure without any data loss!