Operating System - HP-UX
1858604 Members
3727 Online
110392 Solutions
New Discussion

Re: Diagnosing performance issues

 
Richard I Curtis
Frequent Advisor

Diagnosing performance issues

Can anyone point me in the direction of any good docs / resources regarding diagnosing and resolving performance problems on HPUX 11i ?

We had an issue recently where there was no paging, no memory contention, no CPU contention, yet > 300 processes in the waiting state. The whole system became unresponsive for approx 5 minutes with nothing in the syslog etc. I gathered the output from vmstat, swapinfo, a getsysinfo, and a ps (UNIX95 showing each processes memory usage) etc but I am not really getting anywhere in getting to the root cause...

Any pointers greatly appreciated !
13 REPLIES 13
Dennis Handly
Acclaimed Contributor

Re: Diagnosing performance issues

Have you used the trial version of glance?
Richard I Curtis
Frequent Advisor

Re: Diagnosing performance issues

I have glance but during the problem, I could not run *any* commands - not even a simple "uptime". I did try glance but it just sat there. When the system snapped back to life, glance fired up but there was nothing obvious.

While I do have glance, I have never used it in anger - if there are any docs suggesting what to look for in glance then they would be a real help... my problem at the moment is I dont know what I am looking for so even if I have glance open, unless something jumps out at me, I am pretty much in the dark
Steven E. Protter
Exalted Contributor

Re: Diagnosing performance issues

Shalom,

Complete diagnostics:
http://www.hpux.ws/?p=6

System may based on what you post may simply have too many processes running on it. These symptons point to a process binding the CPU into i/o. I/O wait can cause this if all the processes are waiting for I/O or one and other.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Richard I Curtis
Frequent Advisor

Re: Diagnosing performance issues

Thanks for the link... most useful. I am looking for any info (useful man pages, docs, whitepapers etc) that will help me interpret this kind of data.

I'm not afraid of reading - just have not yet found anything comprehensive which gives any detail about the values shown and how to interpret / what to look for.
Hein van den Heuvel
Honored Contributor

Re: Diagnosing performance issues

If this is a single event, or even a repeated event with strong on/off behaviour, then I would not look at it as a performance problem but a hard error.

If you were seeing occasional significant slowdowns under load already then yes, treat is as a performance problem with a very steep knee.

>> during the problem, I could not run *any* commands

So how do you know there was no paging/memory contention and so on as indicated. Some long term log? Some vmstat running per chance?

Some long tem log would / will be handy to catch this should it happen again.

What was still working?
Sounds like you got terminal echo at least.
Any slaves / deamons which spoke in their reports.

It could be a high priority cpu loop, but
this sounds like a connectivity issue.
Some network or fibre switch going burb.
Was there anyone physically near the system at the time of the problem?
Something powered down acidently and re-connected? You'd expect errors and/or time-outs, but still...

Check for reboots on all controllers, switches and such.

good luck!

Hein van den Heuvel
HvdH Performance Consulting.

Richard I Curtis
Frequent Advisor

Re: Diagnosing performance issues

>> If this is a single event, or even a repeated event with strong on/off behaviour, then I would not look at it as a performance problem but a hard error.

This issue happened twice within one hour, but at no other times, but from looking through sar output and vmstat outputs from just after the system came back to life, I am still struggling to work out what was causing it... We have Patrol on the server and the data gathered from Patrol only shows the Number of processes in a waiting state as increasing from 0 up to 300 for the duration of the outages. Nothing else looks out of the ordinary.. The one thing that jumps out at me, was the first hang occured immediately after Oracle had a failed shutdown and the DBA's were investigating, and the second hang occured after they gave up (when they reported the problem to me) and re-tried sorting Oracle out. I will ask the DBA's for more info as to what they were doing, but I wouldn't have thought Oracle should be able to almost hang the whole server - no matter what it was doing.

>>> during the problem, I could not run *any* commands
>>So how do you know there was no paging/memory contention and so on as indicated. Some long term log? Some vmstat running per chance?

I was able to establish a new logon onto the server, but from that, running commands such as uptime, or glance just didnt do anything.. once the slowdown ceased, they all sprung into life.
I did look at measureware stats to conclude that there was no paging etc, *BUT* having just looked a second time, I didnt pay enough attention... there is no measureware data for the duration of the two hangs..there is data immediately before, and after each lockup, but nothing during.

>>Was there anyone physically near the system at the time of the problem?

No - server was in a locked room with no-one in or near it.

>>Something powered down acidently and re-connected? You'd expect errors and/or time-outs, but still...

I had connectivity into the system as I was able to open a new ssh session as root.
If there was an issue with the SAN disks, or network, I would have expected errors in syslog/cstm/event.log... there is nothing.
Interestingly, syslog has got entries for my ssh connections etc so syslog was clearly working... I am starting to think the issue was only affecting new processes (but then if that was the case, why would ssh still be able to fork and give me a new session?)

>>Check for reboots on all controllers, switches and such.

I will do this when I am next in the office but I am still suspicious it was something local to the server.

Thanks for all the feedback so far guys...
skt_skt
Honored Contributor

Re: Diagnosing performance issues

"Interestingly, syslog has got entries for my ssh connections etc so syslog was clearly working... I am starting to think the issue was only affecting new processes (but then if that was the case, why would ssh still be able to fork and give me a new session?)"

i had observed many system hung states. But i never had problem collecting the mwa/PV statistics whihc is very helpful to determine what was happening on the system.(The syslog was getting updated then why not the mwa data which also resides in /var;definitly mwa processes are not new and might have been running for a long time. )

do you see any /var/tombstones/ts99 created?
Bill Hassell
Honored Contributor

Re: Diagnosing performance issues

Do you have a NFS or SAMBA/CIFS filesystems mounted to this system? How about SAN disks? Large scale disk arrays have a very annoying habit of stopping all communication (and producing apparent hangs) while they run around doing internal stuff. That's why the recommended disk timeout is 180 seconds. 'Normal' disks (ie, JBODs) complete their I/Os in a few milliseconds, maxing out at 1/2 to 10 seconds when retrying an error. If you have a big SAN connections, you'll need some tools to watch for internal maintenance and delays. Some SAN vendors provide switch stats that can show long delays.


Bill Hassell, sysadmin
Richard I Curtis
Frequent Advisor

Re: Diagnosing performance issues

>>i had observed many system hung states. But i never had problem collecting the mwa/PV statistics whihc is very helpful to determine what was happening on the system.(The syslog was getting updated then why not the mwa data which also resides in /var;definitly mwa processes are not new and might have been running for a long time. )

The measureware stats are simply missing for the two windows..there is an entry at 20:40, then the next entry is 21:07 continuing util 21:25 and a gap until 21:40

>>do you see any /var/tombstones/ts99 created?

Nothing.

>>Do you have a NFS or SAMBA/CIFS filesystems mounted to this system? How about SAN disks?

We have NFS mounts, although there were no issues on other systems accessing these same mounts, and nothing in any logs suggesting issues accessing these filesystems.
There are SAN disks (a *lot* of SAN disks) but I dont have access to the arrays/switches so will have to ask the Storage guys to look at this. Nothing was reported in syslog, although we have already identified that during the problem, some processes were not writing to logs - ie, MWA etc so this isnt conclusive. I will have to see what our Storage guys can tell me about the SAN at that time.
Bill Hassell
Honored Contributor

Re: Diagnosing performance issues

The symptoms (no0 syslog, no mwa stats, sounds a *lot* like SAN disk delays. You generally cannot use the array information because such delays are considered *normal* (BCV splits, repaired drive reinstatement, etc) so you may have to rely on SAN switch stats. A correlation between array internal logs may also help trace the problem.


Bill Hassell, sysadmin
whiteknight
Honored Contributor

Re: Diagnosing performance issues

Richard,

Here is a quick guide to performance troubleshooting. Hope this help

WK
don't forget to assign points
Problem never ends, you must know how to fix it
bixley
Advisor

Re: Diagnosing performance issues

Do these freezes happen at the same times on within the hour?

What is your DNS configuration?
Contents of /etc/nsswitch.hp_defaults or /etc/nsswitch.conf

and /etc/resolv.conf

If using DNS are the hosts doing the resolving under stress at certain times? (affecting your host with freezes when names cannot be resolved quickly).
skt_skt
Honored Contributor

Re: Diagnosing performance issues

"The measureware stats are simply missing for the two windows..there is an entry at 20:40, then the next entry is 21:07 continuing util 21:25 and a gap until 21:40"

is the CPU/MEM/swap utilization looks normal on the time window you have the data available. is that gradually increasing and stopped data collection at a point..

I hope this server is not part of any cluster.. other wise the server could have done a TOC and you may have a full crash dump for further analysis.