Disk Arrays
cancel
Showing results for 
Search instead for 
Did you mean: 

MSA Controller Failover

Patrick Terlisten
Honored Contributor

MSA Controller Failover

Hello everybody,

I got a very mysterious problem at one of my customers. The customer has seven BL30p bladeserver and one DL360 in a boot-from-san configuration.

- seven BL30p, redudant FC-HBAs (dual FC Mezzanine Card)
- one DL360, single FC-HBA (FCA2214)
- two M-Series 2/12 FC-Switches
- fibre-attached MSL6030
- MSA1000, redunant controller. Each controller is connected to one M-Series switch. There´s no ISL between the switches.

Zoning:

Switch 1

Name - Type - Host
Default - WWNN Zone - MSA1000 Controller 2
Default - WWNN Zone - Blade Bay 10, HBA 2
Default - WWNN Zone - Blade Bay 1, HBA 2
Default - WWNN Zone - Blade Bay 9, HBA 2
Default - WWNN Zone - Blade Bay 2, HBA 2
Default - WWNN Zone - Blade Bay 3, HBA 2
Default - WWNN Zone - Blade Bay 11, HBA 2
Default - WWNN Zone - Blade Bay 4, HBA 2

Switch 2

Name - Type - Host
Default - WWNN Zone - MSA1000 Controller 1
Default - WWNN Zone - DL360 MAX-DPLY1
Default - WWNN Zone - Blade Bay 1, HBA 1
Default - WWNN Zone - Blade Bay 9, HBA 1
Default - WWNN Zone - Blade Bay 2, HBA 1
Default - WWNN Zone - Blade Bay 10, HBA 1
Default - WWNN Zone - Blade Bay 3, HBA 1
Default - WWNN Zone - Blade Bay 11, HBA 1
Default - WWNN Zone - Blade Bay 4, HBA 1
Backup - WWNN Zone - MSL6030 embedded Fibre-Router
Backup - WWNN Zone - DL360 MAX-DPLY1

Each bladeserver is connected to both fabrics. The DL360 is connected to only one Switch.

The blades an the DL360 boots from the MSA1000.

Since two month the customer has the problem that the MSA Controller will failover after a couple of days. The hole system run for over a year without problems. The blades are okay after a failover (there is SecurePath on the bladeserver). The DL360 breaks down (this is okay, the controller has failover and the server lost it´s disks). The only question: Why does the MSA Controller failover?! There is no logical reason for a failover. The Controller are okay, there´s no hardware failure. I checked cables and GBICs, no problem. I know that, if one bladeserver will detect a link-failure or somethin like this, will initiate a controller failover. The other blades will detect this failover, and SecurePath will switch from the preferred to the alternate path.

Any ideas? Can the DL360 cause the problem, due a defective Cable oder GBIC? I changed cable and GBIC saturday, but on sunday the controller switched again. What can also cause a controller failover?

Thanks in advice.

Regards,
Patrick

Best regards,
Patrick
7 REPLIES
Metadata
Valued Contributor

Re: MSA Controller Failover

Hi Patrick,

I assume you have checked the PCI slot ordering for the HBA's ;) I have seen one issue with BL30P blades which sounds the same. The problem is when you boot the blade using Windows 2003, when windows was booting you would see the controller on the MSA failover to the redundant controller. This would only happen durning boot time of Windows. If you failed the controller back it would work fine. The only time you would see the controller failing over was when a blade was booting Windows. As a test can you fail the MSA controller back and then reboot a blade, if the controller fails over then it most like the qlogic driver is causing the controller to failover. I have been told this will be fixed in the next release of the qlogic hba driver

Tom
Patrick Terlisten
Honored Contributor

Re: MSA Controller Failover

Hello Tom,

thanks for your reply. What I´ve seen: The MSA controller doesn´t failover due the boot process of windows. I only recognized some events in the eventlog, but the events are caused due the loop initialization and only bladeserver in the same sleeve are affected. But this doesn´t cause a controller failover, at least in my case. I updated saturday three of the seven bladeservers with PSP 7.40b, but I don´t know if there was a new HBA driver included.

But thanks again for your reply. I will update the HBA drivers.

Any other ideas?

Regards,
Patrick
Best regards,
Patrick
Metadata
Valued Contributor

Re: MSA Controller Failover

Hi Patrick,

Do not use the driver from PSP 7.40B please use this driver here

Scsiport driver for Win 2000 and Win 2003
http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=329290&prodSeriesId=1120361&prodNameId=421599&swEnvOID=1005&swLang=8&mode=2&taskId=135&swItem=co-37098-2

storport for Windows 2003 only

http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=329290&prodSeriesId=1120361&prodNameId=421599&swEnvOID=1005&swLang=8&mode=2&taskId=135&swItem=co-36956-2

I have not come across any other failover issue's with blades and MSA's. If I hear something I will update this thread.

Tom
Patrick Terlisten
Honored Contributor

Re: MSA Controller Failover

Hi Tom,

thanks for that hint. I will try to update die HBA driver. Should I install the Storport driver for W2K3 or the Scsiport driver? The customer has W2K and W2K3.

Regards,
Patrick
Best regards,
Patrick
Metadata
Valued Contributor

Re: MSA Controller Failover

I would go for the storport driver for the Win2K3 systems you should "in theory" get better performance using the storport driver. But I have not seen any massive performance gains with using the storport driver and MSA1500's.

Tom
Patrick Terlisten
Honored Contributor

Re: MSA Controller Failover

Hi Tom,

thanks. I will try the Storport driver. :)

Regards,
Patrick
Best regards,
Patrick
Patrick Terlisten
Honored Contributor

Re: MSA Controller Failover

Hello,

I found the solution for my problem. The customer told me that they have alredy changed the fc-cables. Okay, they changed some cables, from controller A and B to the switches.

I found out that only two blade-servers (same sleeve) report an event (error) that the connection to the active controller failed. The other blades report only a warning to the same time. I changed the fc-cable from the blade-enclosure to the first fc-switch an the problem gone away. It was "only" a problem with a cable. :(

Thanks for help.

Regards,
Patrick
Best regards,
Patrick