VMware and NFS – Disconnected, Lost and Saved?

by David Logue Monday, March 11, 2013

Recently VMware® and NetApp® identified an issue that affects multiple customers. Customers running ESX 5 can experience issues with NetApp NFS data volumes under high load. Vaughn Stewart detailed the “naughty” behavior and the proposed resolutions from NetApp and VMware in his blog at http://virtualstorageguy.com/. Specifically he noted:

Recently VMware® and NetApp® identified an issue that affects multiple customers.  Customers running ESX 5 can experience issues with NetApp NFS data volumes under high load.  Vaughn Stewart detailed the “naughty” behavior and the proposed resolutions from NetApp and VMware in his blog at http://virtualstorageguy.com/.  Specifically he noted:

A NFS datastore disconnect issue displays the following behaviors:

  • NFS datastores are displayed as greyed out and unavailable in vCenter Server or the vSphere client
  • Virtual Machines (VMs) on these datastores may hang during these times
  • NFS datastore often reappear after a few minutes, which allows VMs to return to normal operation
  • This issue is most often seen after ESXi 5 is introduced into an environment

This issue is documented in VMware KB 2016122 and NetApp Bug 321428

While this is great advice, what can you do if you missed the notice (or were not able to implement the fix) and your Virtual Machine is hung, the virtual disks are corrupted, or in the worst case scenario, when the NFS datastore stopped responding, it did not come back?

If the issue is the datastore, the first step is to stop using the aggregate, not just the affected volume.  Because NetApp uses a copy on write type file system with their user and system snapsots at the aggregate level, the Data OnTap® OS writes to the disks all of the time and corruption on the volume can be exacerbated by additional data being written to the volume.

The next step should be to contact support.  Both NetApp and VMware support can assist with investigation into the data loss event.  If support determines that there is corruption or data loss that cannot be corrected or overcome, the next call needs to be to a qualified data recovery company.  Ask your support team for recommendations.  When consulting with a data recovery specialist, make sure to inquire about their experience with NetApp and VMware.

If the issue is the virtual disk, the first step is to stop using the virtual disk. The next step should be to make a copy of the virtual disk to a new datastore (also copy off all of the snapshots and config files).  Avoid making any changes to the virtual disk or the datastore it resides on before the copy is finished.  Once the copy is complete, contact support.  From there, if needed, seek a qualified data recovery company that specializes in enterprise class storage and has specific experience with NetApp and VMware.