BladeSystem - General
1753797 Members
7442 Online
108805 Solutions
New Discussion

Blade Firmware update with no down time.

 
chuckk281
Trusted Contributor

Blade Firmware update with no down time.

Andrew was looking for best practices on how to update firmware and not lose productivity on the blades using the Virtual Connect Support Utility (VCSU):

 

***********************************************************

Please send an example of a "no outage" configuration that will ensure no outage during a V.C. FW update.... the docs suggest a 1 to 10 minute outage - maybe this is "can" cause vs "will" cause - a 1 to 10 minute outage and is for older FW versions - it is still not clear... who has done a FW update in production? What was your experience?

 

Does a correctly configured active/active and/or active/standby configurations with the correct network OS teaming and upstream switch configurations -  result in no downtime duiring a FW update?

 

I had sent out the wrong link - here is the correct URL:

http://h20000.www2.hp.com/bc/docs/support/SupportManual/c02049593/c02049593.pdf?jumpid=reg_R1002_USEN

 

Here are a couple of abstracts:

The VCSU provides a command-line scriptable method to update the VC firmware. It also contains logic that minimizes any network and fabric outages caused by the update process under certain VC configurations. In some cases, the VCSU can eliminate all outages during the firmware update process if correct redundancy has been installed. VCSU must be executed from a Windows or Linux (requires version 1.40 or later of VCSU) workstation or server.

 

Finally, the VC and other networking infrastructure should be updated. These devices take large amounts of time to update and can cause networking and server outages so planning must be done to accommodate large update windows (approximately 1-2 hours for all updates).

 

It is a great document - just need clarification....

***********************************************************************

Alex provided a detailed example:

****************************************************************

I was working with a customer this week and was able to prove them that they will be able to update firmware in the 2-enclsoure multi-enclosure domain while staying live in production. They had a 4-server Linux cluster and 2-server VMware cluster spread between enclosure. First, we had to verify their stacking configuration to ensure that VCSU wouldn’t leave one of the servers in isolation from the network. To ensure that you have to have a straight-down stacking link configuration, not as a loop. Proper stacking guideline are listed on pages 14-16 of this document

http://h20195.www2.hp.com/v2/getdocument.aspx?docname=4AA1-1555ENW.pdf

 

  1. In this particular customer’s case, configuring their network access as Active/Active was a key to success.
  2. Inserting a time delay into VCSU  1.5.0 firmware update “-we”. Amount of minutes to wait would depend on your config but in my case 5 minutes did the trick
    1. VCSU will stage firmware and install on all modules
    2. VCSU will activate (reset) backup side modules first
    3. VCSU will wait X minutes
    4. VCSU will reset all Primary side modules except for the Primary module itself
    5. VCSU will failover Primary to the Backup side
    6. VCSU will reset former Primary module
    7. Several issues discovered around VCSU that are being documented into a CA right now are that in serial mode of activation, VCSU will not failover Primary VCM to the backup. In a normal mode of operation, VCSU will potentially overlap module activations causing network connectivity. Hence need for a time delay.
    8. Verify customer’s profile definitions, make sure that their NICs on A-side are not mapped to the networks on the B-side. IN this particular customer’s config, they had one server mapped to the wrong side networks.
    9. Looks a the VMware host and see if the Beacon Probing is enabled vs. Link State. Link State is preferred method by both HP and VMware and will require them to enable SmartLInk on the VC networks.
    10. If VC FLEx-10 are attached into the Cisco core,  need to enable PortFast on their Cisco switch ports connected to VC.
      1. http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01386629/c01386629.pdf
      2. http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA0-4515ENW page 28 Q12
      3. Look at the HA option on the ESX. That is what will try to determine when to move the guests or put them into isolation and power down. Default timeout is 15 seconds. This particular customer increased timeout to 60 seconds.
      4. One more important thing that I forgot to mention, but I am sure you already knew, ESX Broadcom drivers and boot code have to be at right rev to work with SmartLInk and Flex-10.

 

If you have more question, please let me know.

************************************************************************

 

Does this help you and give you confidence that you can create a systemt that can continue to function even when updating firmware? Have you implemented this? Let me know please.