HPE EVA Storage
1819870 Members
2458 Online
109607 Solutions
New Discussion юеВ

Re: MSA1000/MSA30 performance problems

 
Iain Hamilton_7
Occasional Advisor

MSA1000/MSA30 performance problems

We have a MSA1000 with MSA30 and SCSI 15K disks set up as ADG.

We have a Dual Processor DL380 running Windows 2003 SP1 attached to the SAN. It is only acting as a glorified file server for 2 IIS server machines that are dishing JPG files for a web site, it performs no other roles.

When the IIS servers come under any kind of load we are experiencing a strange problem. The throughput from the SAN seems to "choke" for a few seconds with throughpu falling to zero and then it runs OK for another period and it then chokes again.

Looking at Performance Monitor we never seem to get better than 1.8 M bytes/sec for bytes total /sec. The figure that seems most strange is the value for Memory>pages/sec which runs at an average of 260 and goes as high as 1000. The server choking seems to correspond to peaks with this parameter. There is plenty of free RAM - the server has 2 Gig installed.

Can anyone offer some insight/help.

Thanks

Iain Hamilton
21 REPLIES 21
John Kufrovich
Honored Contributor

Re: MSA1000/MSA30 performance problems

Need more information,
Which HBAs.
MSA FW ver.

Monitor "Physical disk" -> "Current disk Queue length" what is the average.

One more item. What is your network connection utilization. You network could be the bottleneck.

Iain Hamilton_7
Occasional Advisor

Re: MSA1000/MSA30 performance problems

John,

I will find that info out. The SAN Hardware is only about two months old so I think the firmware etc is fairly up to date. The person who has been dealing with this is on vacation so I have picked it up.

I attach a WordPad file that shows the problem.

As you can see the performance drops rather horribly which makes me doubt that it's a networking problem.

Thanks

Iain
John Kufrovich
Honored Contributor

Re: MSA1000/MSA30 performance problems

If you MSA is that new, then more likely the FW is 4.32. Still need HBA version.
Emulex or Qlogic. You can get this info for the Device Manager, under SCSI devices

Your QL numbers looks good. If the QL numbers were high, then I would suspect something in the storage path.

If you network utilization is low then your not feeding the MSA fast enough. Check this bottleneck first.



Iain Hamilton_7
Occasional Advisor

Re: MSA1000/MSA30 performance problems

John,

Thanks for the reply. It's a QLogic controller, the Driver Version is 9.0.0.13. Not sure if there is a firmware version or how to find that out.

Under Policies the Optimize for performance radio button is checked but greyed out (disabled). The Enable Write Caching check box is enabled but not checked.

Under Windows>Performance>Advanced it's set to prioritize Background Services, to favour programs for memory (I assume this is wrong and should be system cache) and the page file size is 4Gb (18.2 GB 15K SCSI mirrored disks)

Using the HP array config the MSA firmware version number is 4.32. It's set up for 50% read and 50% write cache (128 mb for each)

Any ideas about the catastrophic fall to zero for throughput ?

Thanks for the help.

Iain
John Kufrovich
Honored Contributor

Re: MSA1000/MSA30 performance problems

Looking at the pages/sec, you are definitely thrashing your drives.

Did you keep the pagefile local or did you redirect to the MSA?

If your jpg's are small then your MB/s will be small but your IOPS will be high.

What is your network utilization look like. Just monitor it from the task manager.
Iain Hamilton_7
Occasional Advisor

Re: MSA1000/MSA30 performance problems

John,

Thanks for the reply

Attached is an RTF (Zipped) with some graphs.

Task Manager shows Network Utilization running at a max of 17.5%.

From Performance Manager you can see that disk reads to the SAN fall catastrophically to zero, pages/sec,queue length and netwrok bytes/second follow. Disk writes continue through the catastrophy period.

The read requests are coming from IIS 6 on another Win 2K3 machine using pass through authentication. Do you think that the system is hitting some sort of ceiling and bottling out until something clears. We have set the system up following the advice in the MS Tech Net article "Deploying and Configuring Internet Information Services (IIS) 6.0 with Remotely Stored Content on UNC Servers and UNC Devices".

I really don't think it's a network bandwidth issue, the way it falls to zero and the very high pages/sec value really concern me. The swap file is 2 x physical RAM and it's on a local drive. There is about 1.5 Gig of free RAM too. Quite why it's swapping so much is a complete mystery to me.

We changed the cache ratio from 50/50 to 80% read 20% write and the problem seems to have got worse (difficult to be sure - it may be the site is busier).

Our image files are small and we think we probably get a low cache hit rate so we are going to go 0% Read 100% write and see what that does.

But it does feel as though we are groping around in the dark.

Thanks for any help.

Iain

Iain Hamilton_7
Occasional Advisor

Re: MSA1000/MSA30 performance problems

John,

Thanks for the reply

Attached is an RTF (Zipped) with some graphs.

Task Manager shows Network Utilization running at a max of 17.5%.

From Performance Manager you can see that disk reads to the SAN fall catastrophically to zero, pages/sec,queue length and netwrok bytes/second follow. Disk writes continue through the catastrophy period.

The read requests are coming from IIS 6 on another Win 2K3 machine using pass through authentication. Do you think that the system is hitting some sort of ceiling and bottling out until something clears. We have set the system up following the advice in the MS Tech Net article "Deploying and Configuring Internet Information Services (IIS) 6.0 with Remotely Stored Content on UNC Servers and UNC Devices".

I really don't think it's a network bandwidth issue, the way it falls to zero and the very high pages/sec value really concern me. The swap file is 2 x physical RAM and it's on a local drive. There is about 1.5 Gig of free RAM too. Quite why it's swapping so much is a complete mystery to me.

We changed the cache ratio from 50/50 to 80% read 20% write and the problem seems to have got worse (difficult to be sure - it may be the site is busier).

Our image files are small and we think we probably get a low cache hit rate so we are going to go 0% Read 100% write and see what that does.

But it does feel as though we are groping around in the dark.

Thanks for any help.

Iain

John Kufrovich
Honored Contributor

Re: MSA1000/MSA30 performance problems

Interesting,
You have two blues, I'm assuming the active blue is pages/sec. Notice how the pages/sec mirrors your avg. QL. Can you seperate the avg. QL between your internal drive where your pagefile resides and the suspected MSA lun.

I suspect your internal array configuration is the culprit. How is it configured?


Iain Hamilton_7
Occasional Advisor

Re: MSA1000/MSA30 performance problems

The Disk Queue length is all on Drive F (the SAN). Separating it out there is very, very little activity (if any) to/from drive C. This makes the pages/sec value very
strange as the swap file is on drive C.

I have attached a PerfMon graph. During the catastrophe period the disk writes seem to continue but the disk reads fall to zero. Nothing seems to be going in or out of the server. I would have expected the write queue length to go down, but there is a spike that I can't explain. The pages/sec value has me stumped too.


The disks are set up as ADG RAID with currently no read cache (We have tried various settings that don't seem to make a difference) We have changed the surface scan delay to be 20.

This is "more information" for the MSA Controller"

Bus Interface Fibre Channel
Arbitrated Loop Physical Address Domain : 1, Area : 0, Port : 0

RAID ADG status Enabled
Host Controller PCI Based Fibre Channel Adapter
Host Controller Location Slot 3
Host Controller Adapter ID 210000E08B19E6C6
Arbitrated Loop Physical Address (HOST) Domain : 1, Area : 2, Port : 0
RAID Array ID 8A2CJN71V076
RAID Array Serial Number 8A2CJN71V076
Firmware Version 4.32
Rebuild Priority Low
Expand Priority Low
Current Surface Scan Delay 20 sec
Number of Arrays 3
Number of Logical Drives 3
Number of Physical Drives 13
Physical Drives Attached to Port 1 Box 1 : Bay 1, 72.8 GB (Parallel SCSI)
Box 1 : Bay 2, 72.8 GB (Parallel SCSI)
Box 1 : Bay 3, 72.8 GB (Parallel SCSI)
Box 1 : Bay 4, 72.8 GB (Parallel SCSI)
Box 1 : Bay 5, 72.8 GB (Parallel SCSI)
Box 1 : Bay 6, 72.8 GB (Parallel SCSI)
Box 1 : Bay 7, 72.8 GB (Parallel SCSI)

Physical Drives Attached to Port 2 Box 1 : Bay 8, 72.8 GB (Parallel SCSI)
Box 1 : Bay 9, 72.8 GB (Parallel SCSI)
Box 1 : Bay 10, 72.8 GB (Parallel SCSI)
Box 1 : Bay 11, 72.8 GB (Parallel SCSI)
Box 1 : Bay 12, 72.8 GB (Parallel SCSI)
Box 1 : Bay 13, 72.8 GB (Parallel SCSI)

All Physical Drives Assigned Yes

Array Accelerator
Present Yes
Cache Status Enabled
Accelerator Ratio 0% Read /100% Write
Read Cache Size 0
Write Cache Size 256 MB
Battery Pack Count 1
Battery Status OK

This is for the Logical Drive
Controller MSA1000 Controller
Controller WWN 500805F300074951
Controller Serial Number P56350GX3QH062
Bus Interface Fibre Channel
Arbitrated Loop Physical Address Domain : 1, Area : 0, Port : 0

Array A
Array Type Parallel SCSI
Number of Logical Drives 1

Logical Drive 1
Size 277857 MB
Fault Tolerance RAID ADG
Heads 255
Sectors per Track 32
Cylinders 65535
Stripe Size 16 KB
Status OK
Failed Physical Drives None
Array Accelerator Enabled
Selective Storage Presentation Status Enabled
Host Controllers Having Access 210000E08B19E6C6(IIS CS04)


Thanks


Iain Hamilton


John Kufrovich
Honored Contributor

Re: MSA1000/MSA30 performance problems

That last one explains everything.

When an application starts, Windows reserves memory for that application. Any extra memory that is required, the OS will use Virtual memory first. Once the virtual memory is used the the OS will attempt to use memory. In your case, your reading data off the MSA LUN and its is paging it for your application.

You have several options.
Depending on your cpu utilization, you could disable virtual memory. This will force OS to use your available memory. Only neg, if you blue screen, you lose the memory dump. You should notice an overall speed.

Or,
There are some utilities that actually tweak the OS into utilizing more memory.

Finally,
Linux, with linux you can customize your memory to application requirements. Linux attempts to use all available memory first, then attempts to use virtual memory.


Let me add,
You are not QueueDepth bound. But you can increase the QueueDepth of the HBA. This will open the pipe to the MSA. You may not utilize this benefit. Qlogic driver parameters take precedence over the qlogic nvram settings. You will have to locate ql2300 under registry. Look under Parameters -> devices, then driver parameters. Edit, increase the QD value. If no QD is there, then you will have to reboot the Server and go into the qlogic nvram. After Post, the qlogic prompt will appear hit Ctrl-Q. Locate, "Execution throttle" edit this value and reboot. I believe the Default Qlogic QD value is 32 and looking at your perfmon numbers you aren't touching this limit.

One neg of having too large of QD, the MSA will could issue a busy back to the OS. Windows will wait two seconds before reissueing the cmd. Two seconds can seem like an eternity in the micro or milli second world.


John Kufrovich
Honored Contributor

Re: MSA1000/MSA30 performance problems

What I mentioned above about disable VM. You could try setting a set size, instead of windows deciding for you. Some application could complain if there is no VM. If you are just using it as a file server then, you should have any issues.

Iain Hamilton_7
Occasional Advisor

Re: MSA1000/MSA30 performance problems

John,

I'm not sure I understand what is going on and why disabling Virtual Memory would help.

By disabling Virtual Memory I assume you mean setting the swap file to a very low value like 10Mb.

Can you explain that a bit more.


Also, I've looked in the registry and there is no QueueDepth value. When I change the value under "execution throtle" should I increase it or decrease it ?

Thanks

Iain Hamilton
Michael Boots_1
Advisor

Re: MSA1000/MSA30 performance problems

Iain ... it sounds like you've been working on my system!!

We have the EXACT same problem .. well, I'll qualify that and say that it sounds very similar ;) .

We have a BL20pG2 (windows 2000 SP4) connecting through rebadged brocade to the MSA1000. Similar settings on the MSA controller as your original post. The MSA has 2 seperate arrays. 1 x RaidADG (10 x 146GB 19K) and 1 x Raid 1+0 (4 x 72GB 15K). The second is accessed by a seperate system which runs Windows 2003 and doesn not have the same issues.

We initially had zoning problems on our FC envt, which has since been fixed. Immediately after resolving the zoning we had a gap of a few weeks where the issue didn't happen, but it has come back in full force.

This is happening on our std Windows File server.

During times of peak load (backups, large groups of folder permission resets) we get EventID:51 spamming the eventlogs, and have pauses where all I/O to the server appears to freeze.

There is a LOT of paging going on even though the server has 2GB of ram and always at least 1.4GB free.

Running CheckDisk on the volume fails everytime. (we have not let it run with the auto fix errors turned on for fear that it would "fix" everything)

We have made an assumption that the zoning problems have caused some sort of file system corruption and will be holding a shotgun to it's head next weekend, blowing it all away, rebuilding the file server (win2K unfortunately), and restoring all the data from tape. HP have been involved in the process, and at the moment are in agreement with this theory. We will also upgrade the firmware of the MSA, and the support pack on the Win2K file server.

I'll let you know how we go..... good luck with yours.


Here is the "more information":
--------------------------------
Controller MSA1000 Controller
Controller WWN 5008xxxxxxxxxxxx
Controller Serial Number xxxxxxxxxxxxxx
Bus Interface Fibre Channel
Arbitrated Loop Physical Address Domain : 1, Area : 23, Port : 0

RAID ADG status Enabled
Host Controller PCI Based Fibre Channel Adapter
Host Controller Location Slot 0
Host Controller Adapter ID 5008xxxxxxxxxxxx
Arbitrated Loop Physical Address (HOST) Domain : 1, Area : 21, Port : 0
RAID Array ID SGxxxxxxxx
RAID Array Serial Number SGxxxxxxxx
Firmware Version 4.32
Rebuild Priority Low
Expand Priority Low
Current Surface Scan Delay 3 sec
Number of Arrays 2
Number of Logical Drives 2
Number of Physical Drives 14
Physical Drives Attached to Port 1 Box 1 : Bay 1, 146.8 GB (Parallel SCSI)
Box 1 : Bay 2, 146.8 GB (Parallel SCSI)
Box 1 : Bay 3, 146.8 GB (Parallel SCSI)
Box 1 : Bay 4, 146.8 GB (Parallel SCSI)
Box 1 : Bay 5, 146.8 GB (Parallel SCSI)
Box 1 : Bay 6, 72.8 GB (Parallel SCSI)
Box 1 : Bay 7, 72.8 GB (Parallel SCSI)

Physical Drives Attached to Port 2 Box 1 : Bay 8, 146.8 GB (Parallel SCSI)
Box 1 : Bay 9, 146.8 GB (Parallel SCSI)
Box 1 : Bay 10, 146.8 GB (Parallel SCSI)
Box 1 : Bay 11, 146.8 GB (Parallel SCSI)
Box 1 : Bay 12, 146.8 GB (Parallel SCSI)
Box 1 : Bay 13, 72.8 GB (Parallel SCSI)
Box 1 : Bay 14, 72.8 GB (Parallel SCSI)

All Physical Drives Assigned Yes

Array Accelerator
Present Yes
Cache Status Enabled
Accelerator Ratio 50% Read /50% Write
Read Cache Size 256 MB
Write Cache Size 256 MB
Battery Pack Count 2
Battery Status OK
--------------------------------

Do you know Prov0?
John Kufrovich
Honored Contributor

Re: MSA1000/MSA30 performance problems

As I stated earlier, windows will always do paging vs using up available memory.

There are some performance monitoring capabilities in v4.48 FW on the MSA1000. Can I get you to upgrade to the latest FW.

Iain Hamilton_7
Occasional Advisor

Re: MSA1000/MSA30 performance problems

We have an escalated incident with Microsoft running at the moment. I will keep you informed of the progress.
John Kufrovich
Honored Contributor

Re: MSA1000/MSA30 performance problems

Microsoft does have an issue with the classpnp.sys driver, but I think its only applicable to Windows 2003.
Ard Blenke
New Member

Re: MSA1000/MSA30 performance problems

We are facing similar problems, the servers connected to our MSA1000 stop responding for sometime (5~15 secs). The network activity will go down to zero.
I would like to know what the resolution to your problem was.
The servers function as file servers and exchange server. The file server stores the users roaming profiles.
Thanks
Ard Blenke
Michael Boots_1
Advisor

Re: MSA1000/MSA30 performance problems

In our instance, we wrote it down to 3 different causes:

- zoning had not been configured correctly
- once the zoning had been corrected we continued to have the problem. Doing some files system checks, we also worked out that we had some filesystem corruption. We blew away the entire data partition and restored from tape this seemed to fix the problem. We believe that the zoning problems could have caused the file system corruption
- a contributing factor to our file server performance (but not the scsi lockups) that we found (which was also causing some problems on our mail server) was a faulty patch lead to one of our domain controllers which served as the primary dns, and wins server for our domain.

Good luck
Do you know Prov0?
Iain Hamilton_7
Occasional Advisor

Re: MSA1000/MSA30 performance problems

In the end our problem proved to be a Windows Server issue. Basically, because we were accessing static content (images) very, very frequently (2000/second) windows was hanging whilst it periodically updated the last accessed times for all those files !!
We disabled this in the registry

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem

Add Value

Value Name: NtfsDisableLastAccessUpdate
Data Type : REG_DWORD
Data : 1

reboot

That seemed to do it.
Ard Blenke
New Member

Re: MSA1000/MSA30 performance problems

I guess our problem is related to that as well. Last week i found out that one user was creating a heavy load on one server by converting large tiff files into jpg files (total files over 100Gb). Both the source and destination were on the server connected to the SAN. However the problems occured on the second server that was connected to the SAN, it would very frequently simply stop responding for 10 or 20 seconds while the first server seemed to happily on its way processing the large data sets.
This i do not understand. It seems like the load balancing of the servers connected to the SAN is not working properly. If the load is too high i would expect all servers to become slow.
Ard Blenke
New Member

Re: MSA1000/MSA30 performance problems

After some discussions with our supplier and HP we finally got some good support from a technical engineer. Basically it comes down to the fact that the MSA1000 doesn't perform well with large array configurations as we have (19 disks in one array).
This is a surprise to me as i always understood that more disks in an array will improve the performance.

For best performance in the MSA1000 you should create multiple arrays with a maximum of 4 disk where each disk is on a seperate scsi bus. This should give a througput of about 45MB/s per array. Increasing the number of disks in the array will slow it down. In our case it was down to 15MB/s which then had to be shared over multiple servers. If you assign each server an array then each server has 45MB/s instead of sharing 15MB/s.

Would have been nice to know before we started. We've had HP engineers looking at this problem before and they never mentioned this.