09.03.2015 Views

VSAN-Troubleshooting-Reference-Manual

VSAN-Troubleshooting-Reference-Manual

VSAN-Troubleshooting-Reference-Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Diagnostics and <strong>Troubleshooting</strong> <strong>Reference</strong> <strong>Manual</strong> – Virtual SAN<br />

2014-08-24T17:00:30.912Z cpu33:7027198)WARNING: LSOM: LSOMEventNotify:4570: <strong>VSAN</strong><br />

device 52378176-a9da-7bce-0526-cdf1d863b3b5 is under permanent error.<br />

2014-08-24T17:00:30.912Z cpu33:7027198)WARNING: LSOM: RCVmfsIoCompletion:99:<br />

Throttled: VMFS IO failed. Wake up 0x4136af9a69c0 with status Maximum kernel-level<br />

retries exceeded<br />

2014-08-24T17:00:30.912Z cpu33:7027198)WARNING: LSOM: RCDrainAfterBERead:5070:<br />

Changing the status of child state from Success to Maximum kernel-level retries<br />

exceeded<br />

Eventually, the firmware reset on the controller completed, but by this time, it is too<br />

late and Virtual SAN had already marked the disks as failed:<br />

2014-08-24T17:00:49.279Z cpu21:33542)megasas: FW now in Ready state<br />

2014-08-24T17:00:49.299Z cpu21:33542)megasas:IOC Init cmd success<br />

2014-08-24T17:00:49.320Z cpu36:33542)megaraid_sas: Reset successful.<br />

When a controller ‘wedges’ like this, Virtual SAN will retry I/O for a finite amount of<br />

time. In this case, it took a full 24 seconds (2400ms) for the adapter to come back<br />

online after resetting. This was too long for Virtual SAN, which meant that the<br />

maximum retries threshold had been exceeded. This in turn led to Virtual SAN<br />

marking the disks as DEGRADED.<br />

Virtual SAN is responding as designed here. The problem is that the firmware<br />

crashed. This particular issue was resolved by using recommended versions of<br />

MegaRAID driver and firmware as per the VMware Compatibility Guide.<br />

Storage controller replacement<br />

In general, controller replacement should be for the same make and model and<br />

administrators should not be swapping a pass-through controller with a RAID 0<br />

controller or vice-versa.<br />

Expectations when a drive is reporting errors<br />

In this scenario, a disk drive is reporting errors due to bad blocks. If a read I/O<br />

accessing component data on behalf of a virtual machine fails in such a manner,<br />

Virtual SAN will check other replicas of the component to satisfy the read. If another<br />

mirror can satisfy the read, Virtual SAN will attempt to write the good data onto the<br />

disk that reported the bad read. If this procedure succeeds, the disk does not enter a<br />

DEGRADED state. Note that this is not the behavior if we get a read error when<br />

accessing Virtual SAN metadata. If a read of metadata, or any write fails, Virtual SAN<br />

will mark all components of the disk as DEGRADED. It treats the disk as failed and<br />

the data is no longer usable. Upon entering the DEGRADED state, Virtual SAN will<br />

restore I/O flow immediately (by taking the bad component out of the active set of<br />

the effected object) and try to re-protect the object by creating a new replica of the<br />

component somewhere else in the cluster.<br />

V M W A R E S T O R A G E B U D O C U M E N T A T I O N / 1 6 2

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!