VMWare Lost path redundancy to storage with EqualLogic Arrays / Initiator disconnected from target during login.

We were getting some transient alarms in VMWare similar to:

Lost path redundancy to storage device naa.64ed2ad569ca2e943e0a550600002009. Path vmhba37:C2:T0:L0 is down. Affected datastores: DATASTORENAME.

But all SAN HQ, VMWare host logs, and Dell LASO information didn’t shed much light on the situation. The host logs basically told us the same thing the alarms did; a path was dropping and coming back up.

tl;dr Working As Expected – ESX/ESXi hosts randomly drop and reconnect iSCSI connections to an EqualLogic array (2004432)

We had recently switched one of our vSphere installations from NAS to an EQL P6100 and also done some host upgrades to 5.5 within the week.  But, again everything was clean and there weren’t actually any noticeable degradation of performance.

Dell Multipath Extension Module

Now, I had used the Dell Image for ESXI 5.5 (VMware-VMvisor-Installer-5.5.0-1331820.x86_64-Dell_Customized_A00.iso) to perform the upgrade. What I didn’t expect, however, is that the included version of Dell MEM would not be up to date. Granted, the version we were running prior was 1.1.2 where as the current is 1.2.0.

So, I checked the MEM version on the hosts themselves (via SSH):

So you can see the VIB dell-eql-routed-psp is on 1.1.1

After some reading, people had claimed this issue was alleviated by updating the MEM VIB on the hosts as well as updating the firmware on the SAN arrays. Pretty standard, low-hanging fruit advice but worth a shot on our non-production vSphere cluster. (Our new EQL PS6100 was still on firmware v6.x so that was upgraded to v7.x)

Results

So far, our vSphere cluster that has both the VIB and the SAN arrays upgraded no longer see this path degradation issue. It’s a pretty good comparison to our other vSphere cluster which is still seeing these alarms.

Remember to reboot the ESXi hosts after installing the new VIB and verify the VIB version using:

Don’t Forget LoginTimeout and DelayedAck

Unless you want your degraded paths to turn into lost paths, you should also make sure you follow the advice from Dell / EqualLogic and VMWae

Symptoms:

You will see generic lost connection errors in vSphere as well as generic errors in hostd.log

2014-09-24T08:03:22.931Z [29140B70 info ‘Hostsvc.VmkVprobSource’] VmkVprobSource::Post event: (vim.event.EventEx) {
–>    dynamicType = <unset>,
–>    key = 707785232,
–>    chainId = 707785232,
–>    createdTime = “1970-01-01T00:00:00Z”,
–>    userName = “”,
–>    datacenter = (vim.event.DatacenterEventArgument) null,
–>    computeResource = (vim.event.ComputeResourceEventArgument) null,
–>    host = (vim.event.HostEventArgument) {
–>       dynamicType = <unset>,
–>       name = “HOSTNAME”,
–>       host = ‘vim.HostSystem:ha-host’,
–>    },
–>    vm = (vim.event.VmEventArgument) null,
–>    ds = (vim.event.DatastoreEventArgument) null,
–>    net = (vim.event.NetworkEventArgument) null,
–>    dvs = (vim.event.DvsEventArgument) null,
–>    fullFormattedMessage = <unset>,
–>    changeTag = <unset>,
–>    eventTypeId = “esx.problem.scsi.device.state.permanentloss”,
–>    severity = <unset>,
–>    message = <unset>,
–>    arguments = (vmodl.KeyAnyValue) [
–>       (vmodl.KeyAnyValue) {
–>          dynamicType = <unset>,
–>          key = “1”,
–>          value = “naa.xxxxxxxxxxxxxxxxxxx”,
–>       },
–>       (vmodl.KeyAnyValue) {
–>          dynamicType = <unset>,
–>          key = “2”,
–>          value = “”VMFS_NAME””,
–>       }
–>    ],
–>    objectId = “ha-eventmgr”,
–>    objectType = “vim.HostSystem”,
–>    objectName = <unset>,
–>    fault = (vmodl.MethodFault) null,
–> }

and in EQL Group manager you’ll see something like this around the same time:

 Error    9/24/2014 2:38:22 AM  PS6100  7.4.3 | 7.4.23  iSCSI login to target ‘x.x.x.x:3260, iqn.2001-05.com.equallogic:x-xxxxx-HOSTNAME’ from initiator ‘X.X.X.X:30972, iqn.1998-01.com.vmware:HOSTNAME-XXXXX’ failed for the following reason: | Initiator disconnected from target during login.

Resolution:

Make sure to read the Deployment Considerations in the EqualLogic PDF. Aside from the ISCSI connection count, consider the sections as follows:

Deployment Considerations: iSCSI Login Timeout on vSphere 5.x
The default value of 5 seconds for iSCSI logins on vSphere 5.x is too short in some circumstances. For example: In a large configuration where the number of iSCSI sessions to the array is close to the limit of 1024 per pool. If a severe network disruption were to occur, such as the loss of a network switch, a large number of iSCSI sessions will need to be reestablished. With such a large number of logins occurring, some logins will not be completely processed within the 5 second default timeout period.
Dell therefore recommends applying patch ESXi500-201112001 and increasing the ESXi 5.0 iSCSI Login Timeout to 60 seconds to provide the maximum amount of time for such large numbers of logins to occur.

If the patch is installed prior to installing the EqualLogic MEM, the MEM installer will automatically set the iSCSI Login Timeout to the Dell recommended value of 60 seconds.
The iSCSI Login Timeout value can also be set using esxcli with the following syntax:
esxcli iscsi adapter param set –adapter=vmhba –key=LoginTimeout –value=60
See VMware KB 2009330 for additional information.

[…]

Disabling Delayed ACK: Delayed ACK is a TCP/IP method of allowing segment acknowledgements to piggyback on each other or other data passed over a connection with the goal of reducing IO overhead. One side effect of delayed ACK is that if the pipeline isn’t filled, acknowledgement of data will be delayed. In SANHQ this can be seen as higher latency during lower I/O periods. Latency is measured from the time the data is sent to when the acknowledgement is received. Since we are talking about disk I/O any increase in latency can result in poorer performance. Additional information can be found in VMware KB 1002598.
Note: While iSCSI Login Timeout is considered a best practice, it is also consider a requirement, and therefore will always be set to 60 seconds during installation.

You’ll need to reboot the host for the changes to take affect.

Leave a Reply

Your email address will not be published. Required fields are marked *