VMware Guests Lose Network Connectivity
Closed     Case # 10002     Affiliated Job:  New Trier Township District 2031
Opened:  Tuesday, December 15, 2009     Closed:  Tuesday, February 9, 2010
Total Hit Count:  25890     Last Hit:  Tuesday, April 23, 2024 12:31:24 AM
Unique Hit Count:  6410     Last Unique Hit:  Tuesday, April 23, 2024 12:31:24 AM
Case Type(s):  Server, Vendor Support, Network
Case Notes(s):  All cases are posted for review purposes only. Any implementations should be performed at your own risk.

Problem:
4 Dell PowerEdge R710 vSphere 4.0 U1 servers arranged as a clustered environment using embedded 1gb 4 port Broadcom 5709 NICs as a single vswitch NIC Team connected directly to the Cisco Core with switch ports trunked across VLANs. Shared EMC CX4-120 SAN storage is connected through dual Fiber Channel via a Dell QLogic controller QLE2462 joined by a Cisco Fiber Switch 9124 v4.1.3a.

At random times the guests and sometimes even the service console lose network connectivity on the affiliated vmnic and cannot be accessed remotely across the network nor can the console of the guest submit outbound network traffic. If the service console is accessible, the affected guests may be vMotioned or the host may be placed into maintenance mode - once the guest is moved to another host in the VM cluster, immediately network accessibility is returned to the affected guests. If the service console is on the affected vmnic, disabling the port on the Cisco core will cause the migration to an alternate vmnic on the vswitch returning network connectivity to the service console and allowing the affected host to be placed into maintenance mode.

Action(s) Performed:
Total Action(s): 1
Action # Recorded Date Type Hit(s) User Expand Details
10033 2/9/2010 12:20:36 PM Vendor Support 3654 contact@danieljchu.com Upgrade to latest 1/5 VMware releases using VM Update Manager. Also update  Collapse ...
Last Hit: Tuesday, April 23, 2024 12:31:15 AM

Upgrade to latest 1/5 VMware releases using VM Update Manager. Also updated all firmware for each of the following components in our Dell PowerEdge R710 Servers:
-   BIOS 1.3.6
-   PERC 6/i 6.2.0-0013
-   iDRAC 1.30
-   Broadcom 5709 5.0.11
-   QLogic QLE2462 2.02
-   OpenManage upgraded to latest v6.2
-   Broadcom Drivers were upgraded to bnx2 1.9.26c


Diagnosed the issue while it occurred with VMware support, EMC and Dell all together reviewing our issue:
-   vmnic with issue occurring affects all assigned guests and the service console if applicable
-   All other guests on the other vmnics remain unaffected
-   Guests that have been affected support a mixture of OS/Virtual hardware versions including 2003/2008, 32 & 64-bit, version 4 & version 7 virtual hardware - All have the latest version of vm tools installed
-   As long as the service console is assigned to an unaffected vmnic, we can connect to the affected guests through console access via the vSphere Client and log in using the local administrative account
-   vSphere Client, the affected guests and the Cisco switch all report no errors in network communications and appear online - disconnecting the Ethernet cable or taking down the switch port does correctly show the port down in all areas - once reconnected issue persists with affected vmnic
-   Selecting the affected host in the vSphere Client, Configuration tab, Networking hardware, and clicking the blue quote icon next to the affected vmnic - in the pop up window that opens, no CDP details are reported; while all other vmnics report information
-   Switching switch ports, upgrading of above updates, "set port disable [Blade #]/[Port #]" does not help
-   EMC reviews SAN configuration and cannot find any problems with performance or configuration and other guests using the same LUNs on unaffected vmnics respond normally
-   A reboot is the only way to bring the affected vmnic back online


Useful commands in diagnosis:
-   cat /proc/scsi/qla2xxx/4 (reports the versioning information of the QLE2462 controller)
-   cat /proc/vmware/version (reports the driver versions of all hardware in the environment)
-   Install updated Broadcom NIC drivers: (from inbox 1.6.9 to 1.9.26)
   o   Copy driver disk.iso to /tmp of Host (i.e. WinSCP)
   o   SSH to host
   o   If path does not exist create: mkdir /mnt/disk
   o   mount -o loop /tmp/disk.iso /mnt/disk
   o   cd /mnt/disk
   o   ls -l (to see a detailed listing of the path)
   o   esxupdate --bundle= .zip update
-   ifconfig vmnic# up (or "down")
-   esxtop (followed by pressing "n" once after it loads will display the vmnic assigned to each guest & service console)
-   srvadmin-services.sh restart (restarts the OpenManage services)

Resolution:
After discussions with VMsupport, multiple VMware log transfers, EMC SAN review with SPCollects & emcgrab.sh log transfers and Dell over all system reviews. VMware came to the conclusion that the Broadcom NICs were at fault in all four of our servers and suggested a motherboard replacement by Dell. As we began to follow up with Dell on this, VMware got back to us again with a change of their diagnosis. Apparently the U1 release for vSphere version 4.0 created a problem acknowledged worldwide affecting many customers with Dell PowerEdge R710 & R900 servers equipped with the embedded Broadcom 5709 quad port NICs.

"Thank you for your Support Request.

I involved VMware escalation engineer into this SR and we found that same issue is already faced at different customers worldwide with DELL PowerEdge servers which has 4 port Broadcom LOM. We found that the issue happens after putting continuous stress on the bnx2 NIC for sometime which is also true in your environment. VMware engineering has already worked the solution with the product vendors and it is expected to be released with ESX 4.0 Update 2.

Please let me know as how do you want us to proceed on this as of now as U2 is due for release sometime in the end of second quarter 2010 (Dates are tentative and may change if needed).

Looking forward to hearing from you.
"

10/15/2010 Update: Not long after we filed these complaints with VMware, EMC & Dell; Dell provided us with 4 Intel Quad Gig NICs; these have operated perfectly further isolating the issue to those Broadcom embedded NICs. U2 was released to us in Beta; however, we were explained that by installing the Beta of U2, we'd have to rebuild the server in order to depoly the final release and therefore we decided to continue using the Intel NICs until the official release. As of last week we are now operating 4.1; however, we have not reverted back to the Broadcom NICs simply because we have these Intel NICs and don't feel any urgency to migrate back. But we are explained, that these issues have been resolved as of U2.



Profile IMG: Footer Left Profile IMG: Footer Right