Disk Enclosures
1752780 Members
6130 Online
108789 Solutions
New Discussion юеВ

Re: MSA1500cs stability

 
Eric K. Miller
Advisor

MSA1500cs stability

I'm trying to find "anyone" to speak to that has had "success" with the MSA1500cs in an ESX cluster environment (like, say, more than 30 days going by without a reboot of the MSA controllers).

We have a number of MSA1500cs units and all of them have to be rebooted every so often. The MSA1500cs simply stops responding on the fiber channel network. The fiber channel switch reports "Loss of Sync". Disabling/Re-enabling the fiber channel port does "no" good... only a reboot of the MSA1500cs controller solves the problem temporarily (for 10 to 20 days).

ESX 3.0.2 is being used and all firmware updates have been made to FC cards, switches, and 5.20 is being used on the MSA1500cs. We have tickets open with VMware and HP, but nobody seems to have a solution to our problems, even with everything being certified. Hardware has been replaced, various settings have been changed, etc. with no success.

I would love to discuss your environment if you have had success. Please email me at emiller at genesishosting.com.

Thanks!

Eric
11 REPLIES 11
Patrick Terlisten
Honored Contributor

Re: MSA1500cs stability

Hello Eric,

have you tried to upgrade to firmware release v7.00?? I know some installations with ESX and MSA1500cs, but no problems so far.

Best regards,
Patrick
Best regards,
Patrick
Eric K. Miller
Advisor

Re: MSA1500cs stability

Thanks for your quick reply! No, we haven't gone to 7.00 yet. We were planning this, but were afraid to spend anymore money on MSA equipment before having an answer to the problem we're having. This has been on ongoing problem for 2+ years, so we're kinda getting to point where "guessing" that firmware updates, hardware replacements, etc. aren't the answer and that there is truly a hardware design issue at fault.

Can an upgrade from 5.20 to 7.00 be done on a single controller/single FC I/O unit? If not, we have a spare that we can pull the controller/FC I/O units from and put them in one of our MSA1500cs shelves and do the upgrade (without the second FC connection since we don't have 2 fiber channel cards in all hosts "yet").

Is this possible with 7.00 to test without spending any more money on fiber channel cards, etc?

Eric
Eric K. Miller
Advisor

Re: MSA1500cs stability

I went ahead and upgraded one of our non-critical MSA1500cs shelves to 7.00 firmware, even though it has a single controller and single FC I/O module. Seems to work just fine, just as 5.20 did.

I'll do some stress testing with it. It's got 2 MSA20's attached that we can destroy and play with.

I'll keep you posted. Unfortunately, it usually takes weeks for the unit to fail.

Is there really much difference between the 5.20 and 7.00 firmware other than the active/passive versus active/active capabilities?

Eric
Eric K. Miller
Advisor

Re: MSA1500cs stability

As I suspected, the 7.00 upgrade requires a re-signaturing of datastores in ESX, which is quite a pain from what I've seen. I knew this was a possibility, but since this was a test environment, I wasn't too worried.

Anyone know of an easy way to resignature without having to basically spend days disconnecting and reconnecting every VM connected to datastores that needed resignaturing?

This was one article I found, but it's certainly not a pleasant experience:
http://www.shocknetwork.com/forum/vmware-discussion-f20/how-resignature-vmfs3-volumes-that-are-not-snapshots-t138.html

Eric
John Kufrovich
Honored Contributor

Re: MSA1500cs stability

Eric,
We (HP) and VMware are looking into the problem. In some instances, upgrading to esx 3.5 fixes the problem.

Some have reported that disabling HP agents addressed their problem. More specifically, cmahostd. Disable that agent will get you some temporarily relief. That agent is the data collect daemon. So, you will lose functionality.

jk
Eric K. Miller
Advisor

Re: MSA1500cs stability

Thanks John!

At the time of the transition from 5.20 to 7.00, we had one ESX 3.5 host and three ESX 3.0.2 hosts. However, we started with one of the ESX 3.0.2 hosts when doing a "rescan", so maybe this caused the problem. Maybe if we started with the ESX 3.5 host, it could have been avoided.

All 4 hosts are updated to 3.5 now.

As an update to our stability issue, I have 3 VMs, each on different hosts, running I/O meter with random I/O and 100 I/O's per target each, which results in completely filling the 32 I/O queue of the fiber channel controllers on each of hosts. The MSA1500cs is showing 96 or more executing tasks (depending on if the 4th host has a few things to do, which causes it to go higher). Needless to say, this is a ridiculous amount of I/O requests, but that's kind of the point of the test... to see if I can overwhelm either Windows, ESX, or the MSA1500cs. So far, I can't.

The fiber channel controllers are FCA2214's and the fiber channel ports are locked at 2Gbps and locked to be an F-Port.

I started with each VM on its own LUN, but decided to really push it and try to Storage vMotion 2 of the 3 VMs onto the 3rd VM's LUN, so all 3 are on the same LUN (so I can cause a lot of SCSI reservations).

So far, one VM has finished, and the other is 47% complete and still going. The Storage vMotion is happening "while" I/O Meter is running, so it's really pushing the heck out of the MSA1500cs (1,700 or so I/O's per second). The MSA1500cs is connected to 2 MSA20's fully populated, but the LUN is made of 10 250GB disks in RAID 1+0 in the same MSA20.

This is with the 7.00 firmware with a single MSA1500cs controller and single FC I/O card.

I'll keep you posted on whether I break the MSA1500cs to the point where I need to reboot it.

Eric
Eric K. Miller
Advisor

Re: MSA1500cs stability

An update to this issue.

We found the solution to this issue. Locking the fiber channel switch ports to 2Gbps and to type F-Port (not allowing any other type to be detected) was the answer.

Both 5.20 and 7.00 firmware versions are working on the same switch with NO errors. All ESX hosts are stable and NO errors in /var/log/vmkwarning.

All ports on the ESX hosts are still set to auto-negotiate as-is the MSA1500cs (I don't believe there is a way to force the speed to 2Gbps).

Thankfully this issue has been solved. It's definitely been a pain.

To test the environment, multiple VMs on various hosts were running I/O meter, bombarding one of our MSA1500cs units with I/O while Storage vMotioning with ESX 3.5. No problems whatsoever. The Storage vMotion was a tad slow (for obvious reasons), but after a couple hours, the VMs were actually moved, even with all of the I/O meter activity. Quite impressive.

My faith in the MSA series has been upgraded significantly.

Eric
Eric K. Miller
Advisor

Re: MSA1500cs stability

I forgot to mention that the 7.00 firmware was "not" the solution. ESX reported a large number of errors and various issues until the ports were locked on the fiber channel switch.

Eric
http://www.genesishosting.com/
Eric K. Miller
Advisor

Re: MSA1500cs stability

Well, after our last attempt to lock the speed of the fiber-channel port and force it to an F-Port, we thought we had solved our stability problems with the MSA1500cs. Unfortunately, it lasted only 64 days and it locked up last night. The CLI responded, but little or no activity in a "show perf" and "show tasks" showed connections from 5 of our ESX hosts, but no tasks running. A power off/on brought it back to life.

I created a forum here to discuss the issues with the MSA1500cs. I would really appreciate some input from anyone having issues with the unit to see if there are real solutions to the real problems people face with this unit:

http://www.msa1500cs.com/

Thanks!

Eric