Re: scsi queue depth and hpux

slydmin · ‎07-09-2008

We have a HPUX B11.11 PA RISC 64 server (hp9000 rp 7420).
We recently moved our storage from netapps to XP SAN (10K).we are facing some I/O issues possibly related to large LUNs and OS tunable kernel parameters - scsi_max_qdepth , scsi_maxphys

We are seeing a high number of blocked processes with vmstat (vmstat 1). The disks have a very low average queue (sar -d 1 3) and a good enough response time (15 ms on avg). Although most LUNs are a 100% busy, though, which could spell trouble.

Reading HP's forums and manuals, I found that we can tune the kernel parameter, scsi_max_qdepth, which by default is 8. I changed it, to 128, to see if we could relieve atleast some of the blocked processes.

However, we have not seen any change in the blocked processes at all.

Would you be able to coment on this problem?

Thanks,
-S

References:
http://publib.boulder.ibm.com/infocenter/dsichelp/ds8000ic/index.jsp?topic=/com.ibm.storage.ssic.help.doc/f2c_aghpfqudep_laxtvw.html

http://forums12.itrc.hp.com/service/forums/questionanswer.do?admit=109447627+1215615388033+28353475&threadId=634542

http://forums11.itrc.hp.com/service/forums/questionanswer.do?threadId=1057014

http://docs.hp.com/en/B3921-60631/scsi_max_qdepth.5.html

Tim Nelson · ‎07-09-2008

Well IMHO 15ms average for such a high powered array kinda stinks.

The vmstat "b" column does not neccessarily mean disk ( it is anything other than CPU, paging, memory io, disk, semaphores, etc..etc.. )

What do your disk queues look like ?

How many HBAs ?

Bottle neck in the SAN ?

What does the array performance look like from the array's perspective ? If there is a performance issue on the array you can tune the OS until you are blue and never get anywhere.

TwoProc · ‎07-09-2008

There are theories out there... which many people will counter all day long with each other. There are talented forum members here who use and are quite happy with the "large lun" theory.

I don't - on my 10K XP's I use the "lotsa lun" theory. Reason - on a large lun - you've got only one OS structure (the scsi queue) to push all that I/O down. When you use "lotsa luns", you've got lots of scsi structures to push data down. And with that, I rarely have to increase the scsi queue depth.

However, before you begin to think about relaying that stuff out - what the application you're running - can it cache more disk I/O for you ? What about the OS buffers? What are they set at? If they are low, it may be feasible to raise them. Also, what about the layout of the disks in the XP? Is it all Raid 5? Some applications don't do so well with R5 for performance. What about the cache in the XP itself? Do you just have the default amount? You can install lots and lots of cache in your XP, and I generally recommend that sysadmins put as much in there as possible. Why? Well, if you didn't think what you were doing was I/O intensive (and therefore sensitive), you wouldn't have bought an XP storage server. Following along that line of thought - If you didn't need throughput, you could have gone cheaper.

So, I think you need to think about *WHY* you're using up so much I/O, and what you can do to reduce it before you decide what you've laid out isn't good enough.

All of that being said, 10K XPs are nothing less than awesome in most any layout, so I'm afraid that messing with it may not yield much gains as I already feel you're at the point of diminishing returns for incremental configuration changes. So, I'd see if I could reduce I/O through tuning and buffering before I'd go tearing apart the storage configuration. And lastly, did you get any help from HP in laying out the disks in that server at least somewhat optimally for that style of hardware, or did you just cut it up on your own? I'm assuming here that you've already had some best practices already applied here in the layout of the storage server.

We are the people our parents warned us about --Jimmy Buffett

Michael Steele_2 · ‎07-09-2008

Hi -S:

To check on process wait states system wide use UNIX95 and the -o arguement for 'ps'. For example:

UNIX95= ps -ef -o etime,pcpu,vsz,rsz,pid,ppid,comm,state | sort -rn | head

Note state, etime, pcpu, vsz especially. Also, move the arguements around since you're sorting on the first arguement only.

Paste in anything you find interesting. I.e, the highest consumers. What I've seen in the past is one or two processes will appear at the top of highest consumers in several areas.

Support Fatherhood - Stop Family Law

slydmin · ‎07-09-2008

@ Tim

We have a single HBA in our server. (fc 1 0/0/4/1/1 fcd CLAIMED INTERFACE HP 2Gb Dual Port PCI/PCI-X Fibre Channel Adapter (Port 2))

The disk queues are only about 0.5 to 1, but on some LUNs can go up as high as 10-15 during peak activity. Like I mentioned in my previous post, the utiliztion for 5 of these LUNs (datafiles/redologs/undo and temp) is about 100% all the time. Something seems fishy here, doesn't it ?

There about 4 LUNs that are large on this server. 2x600GB 2x1TB. The rest are 100 GB LUNs.

@Two Proc
Originally the plan was to get all 100 GB LUNs and use them, however, things did not go according to the plan and we are with what we have at this time. The server in question is running an Oracle database 9i.

Our storage engineer carved the LUNs himself. I am sure hoping he followed all the best practices for doing that. I think we have 4GB of cache on one controller and this is a 2 controller XP 10k we have here. So, I doubt if this is an issue, however, I am not saying it cannot be.

@Michael

I am not sure I follow you when you say "use UNIX95" See I am a new inductee into the HPUX world and consequently may not understand a whole lot of jargon. I will research for it on the web though. Also, the ps on my system does not have an "-o" switch. Man page does mention it either. Am I missing something?

Thanks for you inputs. BTW did I mention the forums websites is really cool? But I hate it when it timeouts on you after you have typed a loooong never ending response to the posts . And have to type the whole thing again out of memory.

-S

Michael Steele_2 · ‎07-09-2008

If you copy and paste in my command into the command line it will work.

You use UNIX95, which is an environment variable, to activate the -o switch.

Trust me Luke

Support Fatherhood - Stop Family Law

Tim Nelson · ‎07-11-2008

Looks like this is a little bit of everything causing the questionable performance.

1) single pathed server (multiple paths with balancing sw may relieve the stress and/or at least move it to somewhere more visable)

2) small number of large luns ( large luns may impede, one fs stack, one lvol stack, one device stack.( lots of 1s here ). Too many small luns are a nightmare as well. Make happy in middle.

3) disk queues ( caused by single path or disk performance at array ? )

Everything together may be adding a little bit to your issue.

Any stats from the array to report ( sorry if I missed them ).

Are these reads or writes ? or % of. Large number of writes filling the 2GB array cache may just disable the cache ( need to review array stats to be sure ).

Other options.
If you cannot make the hw faster then reduce its load by tuning the application. Maybe this is already done ?

As you can see there is an open road that takes a lot of monitor, tweak, monitor.

TwoProc · ‎07-14-2008

re: redologs/undo and temp
Are these in R01, or in R5? If any of this is R5, you're probably getting hurt big time. On the data areas - you may be hitting disk hard because of sorts on disk. Are you seeing lots of sorts on disk? (Ask DBAs). If so, you've probably got some stuff out there that needs tuning - get the DBA team to run statspack to start getting some data targets. You could also use Enterprise Manager Console for an easy way to look at the top running sql on the box, or the top running sessions on the box. All of this will give you an idea of what processes out there are taxing the server.

For hot disks try moving THOSE to R0/1 instead of R5. You'd see evidence of this if your cache hit ratio is low, OR, your I/O's per second across your database is high. If this is the case, once again it is time to start tuning the worst offending stuff in the database. You'd probably also do well with more cache in the database buffer cache areas. However, review tuning targets thoroghly FIRST before making this step.

We are the people our parents warned us about --Jimmy Buffett

Michael Steele_2 · ‎07-14-2008

Can you paste in the results from this command?

vgdisplay -v | grep -i -e 'PV Name' -e 'PV Status'

And also:

strings /etc/lvmtab

Thanks!

Support Fatherhood - Stop Family Law

Hein van den Heuvel · ‎07-14-2008

>> good enough response time (15 ms on avg

That's just a horrible response time.
Random seeks on a single very roughly equate the rotation time. For a 10Krpm drive with no cache assigned is roughly 1000*60/10000 = 6ms, and for 15Krm what would be 1000(milli)*60(seconds)/15000(rpm) = 4ms

>> The server in question is running an Oracle database 9i.

So what does Oracle tell you about the reason for the block processes. Maybe they are waiting for the client processes to ask for more work (SQl waits)? They would be blocked. Maybe they are waiting for the transactions to commit (logsync)... the processes would be blocked.

For now _please_ forget about the system management tools other that IO/sec and MB/sec.
First and foremost ask Oracle. What are its top wait events, What is the IO response time it is seeing (reads? writes?)?
Later, you may go back to the system tools to help understand, explain and correct what Oracle is reporting.

>> possibly related to large LUNs

Small or large Luns is interesting...eventually.
That's likely a tweak ( less than 10% impact). I suspect there are bigger issues.

>> We have a single HBA in our server.

Now that would worry a lot me from a performance perspective and more so from a sales/design perspective. Who made that decision, based on what? You may need a hose, but you have sold given a straw, to empty a keg!

A 2gb HBA can do 250 MB/sec on a good day.
Just 10 busy discs can deliver that.
You have 200+ disks right? 10x more!
The Data Bandwidth for an XP1000 is 8500 MB/sec... 40x more than a 2gb fibre.
250MB/sec @ 8KB/IO = max 30,000 IO/sec (much less with protocol overhead).
XP1000 cache performance is rated @ 700,000 IO/sec... 20x more.

Hope this helps some,
Hein van den Heuvel (at gmail dot com)
HvdH Performance Consulting

Hein van den Heuvel · ‎07-14-2008

>> But I hate it when it timeouts on you after you have typed a loooong never ending response to the posts . And have to type the whole thing again out of memory.

1) Before hitting enter... ^A, ^C to "select all" and "copy".
2) The "back" button tends to bring the typed text back as well
3) Always clone a second window to check on a long delay. Oft the reply is in fact there.
4) Best advise, but one I fail to honor myself most time, it to type into an alternative tool, Gmail, Word,... with spellchecker, and occasionally paste into the reply window as backup and for the final submit.

Hein.

Vadsys · ‎08-20-2008

Thanks a lot for all the comments made to my OP.

We went through a lot of changes since I first knocked on ITRC's doors.

There was not "one" root problem that was causing the issue we had, rather several issue that were overlooked or otherwise not addressed.

We completely segregated the storage for this database on XPSAN and all other services.

Problems that accrued and caused the issue
1. Originally storage was shared with windows,citrix and everything else.
2. RAID-5 on XPSAN
3. Single 2GB HBA for the server
4. Port contention at the XPSAN level (3 databases sharing 2 ports, one was always 88% utilized)
5. No load balancing on hpux, just a fail over for the the LUNs added.

First off, we took the pain, and I mean real pain, to allocate disk space for our DB in question on the current XP SAN, move all the data from those disks around and format that newly allocated space RAID 0+1.

Secondly, we isolated the sharing of ports on the SAN fabric, so we were going on one port for the other databases.

These steps helped us improve our performance by about 70% over the previous time!

Steps that should have been done right in the first place, imho, but never too late to be done at any time.

We still saw some contention at the HBA level, however, and there are plans to get another downtime to add more HBA(s) to the system. There is also some kernel level tuning that should (and will be, eventually) done to get more bang for the buck, specifically related to vxfsd, biod etc..you get the gist.

BTW, I cannot login with my old handle/account, so had to create a new one and am afraid I may not be able to close the thread.

Thank you for taking interest in my problem and providing valuable inputs!

Regards,
-S

Jeff N. Graham · ‎09-09-2008

Scary. I'm having a very similar situation. We are running:
HP9000 rp7420 (8 PA RISC, 16Gb Ram, 2 seperate A6826A HBA (4 HBAs total)
HP-UX 11v1
Oracle 9i RAC (2 nodes/same hardware)
IBM SAN (8x8 array of disks)

We have ETL process that must update 5 million rows into a 1.5 billion row table(188GB) - e.g. lots of I/O. ETL is well tuned and table is partitioned fairly well. There is always room for improvement, but statspack shows our SQL to be efficient, but by far we spend all of our time of db sequential read (88% of time - hardly any time on anything else).
Avg
Total Wait wait Waits
Event Waits Timeouts Time (s) (ms) /txn
---------------------------- ------------ ---------- ---------- ------ --------
db file sequential read 3,092,363 0 28,319 9 4,507.8

Tablespace
------------------------------
Av Av Av Av Buffer Av Buf
Reads Reads/s Rd(ms) Blks/Rd Writes Writes/s Waits Wt(ms)
-------------- ------- ------ ------- ------------ -------- ---------- ------
DATA_PRICING_02
555,304 182 9.2 1.0 286,131 94 0 0.0
DATA_PRICING_01
483,639 159 8.9 1.0 268,175 88 1 0.0
DATA_PRICING_03
405,730 133 8.7 1.0 282,873 93 56,048 12.2
DATA_PRICING_04
306,559 101 8.7 1.0 141,780 47 2 0.0
INDEX_PRICING_02
433,181 142 10.4 1.0 51 0 0 0.0
INDEX_PRICING_01
409,071 134 10.2 1.0 367 0 1 180.0
INDEX_PRICING_04
285,588 94 10.0 1.0 27 0 88,147 6.9
INDEX_PRICING_03
176,245 58 10.1 1.0 1,009 0 1 0.0

We originally only had 1HBA per node connected. We saw at least a 50% increase when we connected fiber to 2nd HBA cards and installed multipath (not to mention much needed high-availability!)

We are still having issues and processes are running behind. Storage team shows that disks are peaking around 30-40% and switches are scratching just 8% utilization.

Included is usual SAR/VMSTAT/GLANCE metrics

Highlights:
sar -Muq 5 50

HP-UX dubhst05 B.11.11 U 9000/800 09/09/08

16:02:54 cpu %usr %sys %wio %idle
cpu runq-sz %runocc swpq-sz %swpocc
16:04:19 0 17 10 72 2
1 12 6 77 5
2 12 6 80 2
3 14 4 79 3
4 11 7 69 12
5 13 8 77 2
6 10 8 76 6
7 11 8 79 2
system 12 7 76 4
0 0.0 0
1 0.0 0
2 0.0 0
3 0.0 0
4 1.0 20
5 1.0 20
6 0.0 0
7 1.0 20
system 1.0 8 0.0 0

sar -b 5 50

HP-UX dubhst05 B.11.11 U 9000/800 09/09/08

16:02:54 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s
16:03:44 0 23 100 3 14 81 1 1
16:03:49 0 978 100 3 65 95 3 4
16:03:54 0 362 100 2 26 94 3174 3
16:03:59 0 14 100 2 10 81 5015 6
16:04:04 0 48 100 2 19 90 5143 812
16:04:09 0 170 100 3 28 89 3894 215
16:04:14 0 32 100 7 19 63 2686 511

vmstat 5 50
procs memory page faults cpu
r b w avm free re at pi po fr de sr in sy cs us sy id
2 0 0 1245180 4578774 1144 0 0 0 0 0 0 3935 11206 770 3 4 93
2 0 0 1245180 4560624 496 6 0 0 0 0 0 16749 51650 6168 24 7 68
3 24 0 1313511 4560368 163 0 0 0 0 0 0 31123 98335 11080 41 13 46
3 24 0 1313511 4559388 80 0 0 0 0 0 0 41878 132330 14238 48 17 35
3 21 0 1241939 4563921 76 9 0 0 0 0 0 37473 117464 13265 31 13 57
3 21 0 1241939 4566861 25 2 0 0 0 0 0 27940 83425 10735 15 9 76
2 19 0 1208195 4566749 17 0 0 0 0 0 0 24451 69596 10150 13 8 79
2 19 0 1208195 4566748 12 6 0 0 0 0 0 20947 57637 9343 14 8 78
6 21 0 1221251 4566748 4 0 0 0 0 0 0 23511 63185 10861 16 9 75

We have worked with que_depth and found about 5% increase in going to 32.

I would like opinion on following and the order of importance (i.e. which should we try first?):

1. Oracle recommends that system is i/o bound if sar %wio is consistently above 20 (page 69) http://download.oracle.com/docs/pdf/A97297_01.pdf
Is that true?

2. SAN supports many other servers. Even though they are not 100% utilized, would segregation still be helpful?

3. Would there be a benefit to bringing other HBAs online? Only 1 port on each A6826A card is being used. If we bring on the other 2 that would give us 4 total HBAs per node. Oracle 10G experts say as a rule of thumb that a well balanced system will have 1 HBA for each CPU (Pg: 18-20 ) Does that mean we should have 8 HBAs per node? http://www.oracle.com/global/kr/download/seminar/2008/odd/0805/current_trends_in_database_performance_final.pdf

4. Is there benefit in presenting more LUNS? We have 4 partitions in the table going to 4 tablespaces going to 4 LUNS. Is there value in spreading that out? For the large 1.5B row table we have 4 luns totalling 300Gb.

Categories

Company

Local Language

Forums

Discussions

Knowledge Base

Forums

Discussions

Knowledge Base

Forums

Discussions

Knowledge Base

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: scsi queue depth and hpux

scsi queue depth and hpux