in Infrastructure

When adding Physical NICs, dvUplinks Fail Due to Inadequate Driver Heap Size

As part of our efforts to continuously grow and mature our network at CoverMyMeds, we recently implemented new network segments for our main office location.  The servers in the VMware environment at our office are older, rack-mounted systems, so we decided to dedicate two of the eight physical NICs on each host to support and separate VM traffic on these new segments.

Typically in a dvSwitch environment, one would create distributed port groups that map to each new VLAN, ensure that the back-end switching configuration sends the proper 802.1Q VLAN tags over those trunk ports, then dedicate physical (VMNIC) interfaces to dvUplinks.  At that point, you could connect a VM to the distributed port group and off you go.

…Not so fast.

As I added the VMNICs to the dvUplinks, something strange happened.  What should have been a 3-second task in vCenter hung.  And hung.  And hung.  Finally, after about 10 minutes, a timeout error appeared.  Shortly thereafter, the host disconnected from vCenter.  This was especially puzzling because the VMs that were running on the machine stayed alive and reachable on the network, and even the management interface of the affected hosts returned pings just fine.  It was as if the management agents themselves on the host just became completely unresponsive. (Lesson learned… put your server in maintenance mode before trying this sort of operation.)  A reboot brought the server back into vCenter’s view but I was unable to add the VMNICs I needed in order to provide connectivity to my new VLANs.

Through investigation with VMware, we found the following entries in the vmkernel.log file:

WARNING: Heap: 2796: Heap vmklnx_bnx2 (68469696/68476928): Maximum allowed growth (8192) too small for size (69632)

followed shortly by

WARNING: Heap: 3058: Heap_Align(vmklnx_bnx2, 65536/65536 bytes, 8 align) failed.

and

<3>bnx2 0000:05:00.0: vmnic6: Cannot allocate firmware buffer for uncompression.

It turns out that this known issue affects versions of vSphere up to and including 5.5.  We simply ran out of driver heap space to perform the operation in question.

Thankfully, the fix turned out to be simple to implement. A single command (esxcfg-module -s “heap_initial=4194304 heap_max=129368064”) increased the max heap allocation size. Then, a host reboot made the change take effect.  Running esxcfg-module -g bnx2 after the reboot confirmed that the changes persisted and the dvUplink addition worked as expected.

A word of caution: tinkering with driver settings in this fashion can have extrememly unpredictable results, so this should only to be done on an as-needed basis to resolve this particular issue.  Or in your test environment, of course.

Write a Comment

Comment