MSA Storage
cancel
Showing results for 
Search instead for 
Did you mean: 

MSA 1000 Performance Problem

 
Jason Keane
Occasional Advisor

MSA 1000 Performance Problem

Hi All,

I wonder if you could shed some light on a write performane problem we are having with a new MSA 1000 unit. HP support consultants/engineers have been looking at this for the last week and I am beginning to lose patience with the lack of progress on this.

The configuration is as follows:

3 x DL380 G4 Servers running Windows 2000 SP4 with 2GB RAM and mirrored 72GB 10k disks on the internal array.

Each server contains a HP FC2214 2Gb fiber card with the latest firmware and drivers (as per the support site).

These three cards connect to a Brockade(spelling?) switch plugged internally into the MSA 1000.

The MSA 1000 is configured with 256MB cache in a 50/50 split for read/write. The firmware is version4.32.

There are 10 x 146GB 10k U320 disks in the MSA and these are divided into two RAID5 sets (4 disks each) and one RAID1 set (the other two disks).

When using explorer to copy a file from the internal RAID or a SAN disk to a SAN disk we are getting approximately 10MBytes/sec throughput. Using IOmeter to write we are getting approximately 10MB/s also. Using IOmeter to read the disks we are getting throughput of about 130MB/s.

To troubleshoot the problem the following things have been tried:

1. The brockade switch has been replaced
2. The MSA controller has been replaced
3. The Qlogic cards were replace with an Emulex card
4. The cache split has been altered (0% read / 100% write)
5. A pre-release of the MSA controller (version 4.4) has been tested
6. All volumes have been defragged
7. The internal array controllers (on DL380) have been upgraded
8. The brockade switch has been removed and a direct fiber link to the MSA has been tested (still only 10MB/s)
9. A Dell server has been attached (still only 10MB/s)
10. Cache has been enabled and disabled (no difference) using dskcache.exe

Now for the real spanner in the works. If we start a write from one server we are getting 10MB/s throughput as monitored on brockade switch. If I start another write job on each of the other two servers my total throughpt on the switch is three times the maximum input from the servers (i.e. 10MB/s from each server and I can write at 30MB/s!!!) This is verfied by the port throughput performance graph on the switch.

So it looks as if the MSA is happy to write at 30MB/s but the servers seem to be limited to outputting 10MB/s. Again I can read at 130MB/s!!

Any body shed some light on this? Or do I just return the whole kit a defective?

Thanks for any help,
Jason.







28 REPLIES 28
Bostjan Kosi
Trusted Contributor

Re: MSA 1000 Performance Problem

Hi,

Have you maybe implemented zoning to separate the servers on the hardware level from each other? I would suggest checking some queues using performance monitor, to establish where is the bottleneck exactly. Check the io queues, if the queue is full than there could be something in the HBA driver/setting. You'll probably find there is something within the server (OS) stopping your bits and bytes from flying. You say that you tried also with Emulex..and the result were the same? Secure Path installed? If you still have Emulex cards, try increasing the queue depth and setting the tprlo parameters....

If the system is not yet in the production, you cloud try creating one array and two raid 1 volumes inside, just to try the performance....

I hope I have given you any untried options..must agree with you this is strange

rgds

Bostjan
Nothing is impossible for those that don't have to do it themselves!
Glenn N Wuenstel
Valued Contributor

Re: MSA 1000 Performance Problem

Jason,
Have you checked to make sure that there is no pending activity on the arrays? You can see this by going into ACU and looking for messages. If the array hasn't finished initialization then you will see a performance hit.

Glenn
KurtG
Regular Advisor

Re: MSA 1000 Performance Problem

Do all LUNS you expect the same performance from use the same amount og physical disks?

I you can post the "More Info" text from all the LUNs this might help.

KurtG
Jason Keane
Occasional Advisor

Re: MSA 1000 Performance Problem

Hi Folks,

Thanks for the offering of help so far. Here's some info for you

1. We've checked with performance monitor the queue's, etc and all seems within normal limits.

2. The Controller is no busying doing anything as we are testing on a RAID 0 (single disk)

But now for the spanner... we upgraded one of the servers to Windows 2003. After the upgrade it was still slow (about 12MB/s). When then turn on the Enable Write Protect option for the MSA in Device manager (Disk Drives -> Policies) and we could write at about 70-80MB/s! Happy days. However we need to turn on this option for Windows 2000. We are running SP4 and the hotfix as described here is installed (by default with SP4)

http://support.microsoft.com/default.aspx?scid=kb;en-us;811392

When we use the dskcache.exe program to turn on the cache it says its enabled but there's no performance increase. It really looks now like the write cache is not enabled. We have opened a call with MS to see if there's something here.

Surely I am not the only one to see this? Are people just running with a slow SAN and not notice?

Ta
Jason.

Re: MSA 1000 Performance Problem

Hi Jason,

We're having a similar issue with our MSA 1000 SAN attached to HP ProLiant DL380s.
- MSA firmware 4.32; 512 MB cache (2 controllers)
- 14 HDD 73GB 15K rpm.
- QLogic QLA23xx FCA

I've noticed little to no performance improvement when running an Oracle database striped over the 14 disks compared to one on just a local disk. So far, I've tried different RAID scenarios and cache settings. I'll check out the results with IOMeter

Denys

Re: MSA 1000 Performance Problem

update
John Kufrovich
Honored Contributor

Re: MSA 1000 Performance Problem

Denys,
What is your RAID level on that 14 disk array. What is the stripe size. You mentioned Oracle, what is your block size.




Re: MSA 1000 Performance Problem

John,

I'm running a batch job that generates about 30GB of redo in 2 hours. The I/O profile when I use dedicated disks for undo, redo, tables, index and temp, etc. is 100% busy on redo and undo. Total IO = > 60% write I/O.

In the Statspack report log file sync, log buffer and db file waits amount for 99% of the wait time with a cpu/wait ratio of 30/70. On waits its about 50/50 (redo write waits vs db file read waits).

One of the nuttiest configuration I tried was to stripe RAID0 over 14 disks for redo only to get the lof file sync waits down. This this did improve things much. The application commit rate is very high but this can not be altered.

I have not yet tried different block sizes to reduce the db file waits. Its at 8K.

Remarkably enough, the execution time of the batch is about the same for the MSA1000 SAN (14 disks in different RAID1+0 configs) as when using 2 local disks on the DL380 (Windows 2003 SE).
KurtG
Regular Advisor

Re: MSA 1000 Performance Problem

Are you sure the LUN(s) spans all 14 disks? I have seen situations where an array previosly expanded with more disk did not reconfigurer the LUNs to span all new physical disk.

KurtG
John Kufrovich
Honored Contributor

Re: MSA 1000 Performance Problem

Denys,
Can you provide the HBA parameters registry parameters.

LM, -> System, -> CurrentControlSet ->Services

Locate either HP2300 and or ql2300
I need everything under parameters->devices.


Jason Keane
Occasional Advisor

Re: MSA 1000 Performance Problem

Hi Folks,

Just another update for you. I believe I have exactly isolated the problem and I don't think it's related to RAID set, spindle count, etc. FYI we have done all tests on RAID 0,1,5 wit the same problem.

To explain in detail it appears that prior to Windows 2000 SP3 (i.e. RTM, SP0 and SP1) the MS code had a bug that allowed file data to be held in a controller write cache (with or without battery backed write cache or BBWC). This became a problem when power failed and you didn't have battery backed controller cache as you lost data. To resolve the data loss issue in SP3 MS fixed the bug by updating the disk.sys driver to send the SCSI command WRITE + FLUSH for every file write command. This SCSI command has the effect or forcing the controller to write its cache immediately to disk, thus bypassing any write caching that might go on - even if you've BBWC. As a result you get really poor write performance as you are writing directly to the disks bypassing the write cache. The read cache is always enabled as no data can be lost in a disk read.

The net effect of this is that a read from the MSA will have high-throughput (as it's from the cache) whereas the write will have poor throughput. Tesing with IOMETER will prove the read performance as you can do a pure read and IOMETER can be forced to filter the FLUSH command from the SCSI sequence to demonstrate high write throughput. The problem with IOMETER in this instance though it that it does not function the way Windows I/O happens (e.g. using Explorer, Exchange, SQL, etc). Therefore it can give inaccurate results at to the overall true system performance.

Microsoft posted a SP3 patch that was included in SP4 to resolve this issue and enable write caching using dskcache.exe. The problem is that either the MS or HP driver ignores the WRITE command and automatically adds a FLUSH command. By placing in a FLUSH command filter (using a small kernel mode drive) the high write throughput can be obtained. However this is not a supported by either MS or HP configuration.

Upon escalation of the issue to both vendors (MS + HP) each is adamant that it's the other vendors issue. To be fair to MS at least they are still troubleshooting it, HP are of the opinion it's purely a MS issue - and maybe they are right.

This problem was resolved using Windows 2003 and the "Enable Advanced Performance" option. Which effectivley filters out the FLUSH command but in an MS approve code base (i.e. you have support).

Checking of LUNS, RAID sizes, types, etc. I don't believe will resolve the issue. It simply a software issue with Windows 2000 SP3 and SP4.

I hope this clarifies for everyone.

Cheers,
Jason.



John Kufrovich
Honored Contributor

Re: MSA 1000 Performance Problem

Jason,

There was an issue with older MSA FW and MS setting the FUA bit. The 4.xx MSA FW ignores the FUA bit, because we have BBWC.




Jason Keane
Occasional Advisor

Re: MSA 1000 Performance Problem

Hi John,

We are running Firmware 4.32 on the controller. However if it ignores the FUA bit, how come on Windows 2003 there is no performance increase until the "Enable Advanced Performance" is selected? My understaning of this option is that it turns off the flush command? Is this right?

Thanks,
Jason.
John Kufrovich
Honored Contributor

Re: MSA 1000 Performance Problem

I wish I had insight into MS SCSI driver. Perhaps this is the reasoning, they changed the SCSI driver for 2003.

John Kufrovich
Honored Contributor

Re: MSA 1000 Performance Problem

If you have access to the cli, look at your windows profile. You will see we ignore the FUA (Force Unit Access) bit. If you want to test something out. Change your host profile to the degraded Windows profile, where we accept the FUA.

Jason Keane
Occasional Advisor

Re: MSA 1000 Performance Problem

Hi John,

I can connect up a cable to the MSA and test from there. When you mention Windows degraded profile, I assume this is a profile setting within the cli of the controller?

Ta,
Jason.
John Kufrovich
Honored Contributor

Re: MSA 1000 Performance Problem

Right,
You can also set via ACU SSP select host profile.

Re: MSA 1000 Performance Problem

John,

Thanks for your input. Attached the registry export for HP2300 and QL2300
John Kufrovich
Honored Contributor

Re: MSA 1000 Performance Problem

Jason,
In your first post, you mention you had pre-release FW. Where did you get it?

Since you are running windows, what does your perfmon look like.
perfmon - physical disk - select the lun you are writing to. Do as you stated above. Is you graph a flat line. Describe what you see or provide a snap.





John Kufrovich
Honored Contributor

Re: MSA 1000 Performance Problem

Denys,
Thanks for sending the information.
Everything looks fine.

Lets look at something else. It will require a reboot of the server. After POST, you will see a msg for qlogic card and have the option of entering the FW.
Ctrl-Q, select it. Locate the Execution Throttle parameter. Tell me what it is.

Can you describe your SAN. How many servers are connected to the msa, any clustering, hba's per server, MSA LUN count
and MSA luns per server.

Jason Keane
Occasional Advisor

Re: MSA 1000 Performance Problem

John,

We got the pre-release of firmware from HP themselves. We had a HP storage consultant onsite who was able to locate it from his own resources and perform the upgrade.

We currently have two MSA controllers one with 4.32 and another with 4.42 (pre-release). These are not installed together but the 4.42 one is simply a spare part that was shipped in as it was believed the controller was the issue.

With regard to the disable/enable the FCA bit, we see no performance increase with Windows 2000 as expected, but rather unexpectately we see no performance decrease with Windows 2003 on either setting!

We run a number of tests on Perf Mon LUN, which would be best too look at?

Not to overstep the line on Denys question/suggestion, but we have tried throttle values of 15 (default), 60, 90 and the max a 255 with only a small change in speeds (i.e. maybe from 10MB/s to 10.4MB/s).

Cheers,
Jason.
John Kufrovich
Honored Contributor

Re: MSA 1000 Performance Problem

Jason,
Let us compare numbers,
create a 4G test file. At cmd prompt type,
fsutil file createnew PATH:\filename.txt 4294967000

Not sure if windows 2000 has fsutil. If not you can copy from an XP or 2003 machine.

Windows Explorer copies at 64K blocks.

Here are my results
My configuration
380G3, emulex hba
The server has internal RAID 5 lun made of 5 disk.
MSA RAID5 LUN, 4 disk = ~18MB/s
MSA RAID5 LUN, 6 disk = ~27MB/s
MSA RAID5 LUN, 10 disk = ~49MB/s

I copied the from the internal RAID 5 to each one of the MSA LUNs. Several times I copied back to the internal lun. Even though my 380G3 has a 5i(u160), I only got around 15MB/s.

Looking at the above data, for proximation purposes you can figure ~5MB/s per disk in RAID 5 configurations, up to a point.


kelly trosper
Occasional Advisor

Re: MSA 1000 Performance Problem

Jason,

You wrote:
Tesing with IOMETER will prove the read performance as you can do a pure read and IOMETER can be forced to filter the FLUSH command from the SCSI sequence to demonstrate high write throughput.

How do you filter the FLUSH command?

I'm seeing some poor write performance to my MSA 1000...

KT
Jason Keane
Occasional Advisor

Re: MSA 1000 Performance Problem

>I'm seeing some poor write performance to >my MSA 1000...

Ahhh... I'd nearly forgotten about this post.

For those of interest in the end we had to migrate to Windows 2003, as HP argued it was a MS problem and MS argued it was a hardware issue. Didn't really work for me now on 2003 the system is flying along (it is a software issue and the whole debate of spindle speed, etc, is only for squeezing the last out of the disks)

With regard to IOMETER and testing write performance we had to apply a "patch" to the MSA driver before IOMETER would give the high-throughput. The HP engineer supplied this patch so I'll see if I still have it.

Cheers,
Jason.