ESXi management network issues when using EtherChannel and NIC teaming

Posted by on March 5, 2011 in vSphere | 1 comment

ESXi behavior with NIC trunking

Sometimes very challenging problems will arise.  Things that make you scratch your head, want to hurl your coffee cup, or just have a nice cold adult beverage.  Customers can change a projects requirements mid-way through, a vendor’s storage array code upgrade can go awry, or a two can creep into the ones and zeros.

In this section, we present examples of those crazy situations with the hopes of helping out our fellow engineers in the field before they become as frustrated as we have!

Recently in working with a customer, the request was for a new cluster comprised of ESXi 4.1 hosts.  They would be using just two onboard NICs for the vmkernel and virtual machine traffic.  These two NICs would feed into a pair of Cisco Nexus 5020’s, using virtual port channel (VPC).

Because of the use of VPC, the virtual switch load balancing needs to be set to IP Hash for the NIC teaming policy.  Okay, no sweat!  After installing ESXi and completing the initial configuration on the hosts, it was time to add the second NIC to vSwitch0 and plug it in.  (Note this configuration is all being done on the hosts directly as no vCenter server has been built yet).  After adding the second adapter to the active section of vSwitch0, and changing the NIC teaming policy to IP hash, we plugged in the second cable.

The host immediately lost connection from our vSphere client, and dropped off entirely from being able to be contacted.  No ping, no nothing!  This was most puzzling indeed:  we unplugged the second cable and the host started to ping again.  We thought maybe there was something wrong with the NIC itself, and so setup a separate NIC to take its place.  This had the same result, and we then thought to look at the switch.  After discussing the current configuration with the network engineer, we felt that his configuration was correct.  The configuration (and more!) can be found in the white paper put out by Cisco and VMware: “Deploying 10 Gigabit Ethernet on VMware vSphere 4.0 with Cisco Nexus 1000V and VMware vNetwork Standard and Distributed Switches – Version 1.0” This doc has been a very helpful during the implementation of this project.

So!  With the network being deemed not the problem and wearing a sheepish smile on my face after the network guy commented “it’s always the network isn’t it?” I returned to the host.  I then tried setting up both NICs on a non-nexus switch that is being used for out of band management, and they worked just fine using virtual port id for NIC teaming.  So at that point, I fired up the googalizer and did some checking.  I came across this KB article from VMware:

VMware KB 1022751:  NIC teaming using EtherChannel leads to intermittent network connectivity in ESXi

Details:

When trying to team NICs using EtherChannel, the network connectivity is disrupted on an ESXi host. This issue occurs because NIC teaming properties do not propagate to the Management Network portgroup in ESXi.
When you configure the ESXi host for NIC teaming by setting the Load Balancing to Route based on IP hash, this configuration is not propagated to Management Network portgroup.

So, based on this very helpful information, I followed the instructions listed in the KB and had great success.  Now my ESXi hosts are talking on both NICs via IP Hash and life is good.

1 Comment

  1. I don’t think this is regulated to just the management network. I had this exact issue yesterday bringing a prod host vms down. I’ve had two cases of intermittent network drops on etherchannelled esxi 4.1 hosts in the last month.

Leave a Reply

%d bloggers like this: