topic Re: Need help troubleshooting performance issue in Operating System - HP-UX

Need help troubleshooting performance issue

Tony Williams — Mon, 02 Nov 2009 15:54:05 GMT

Hi,

We had a performance problem on Friday where we brought a server to its knees, logins took 10 minutes process. The problem lasted 30 minutes until we stopped a few Oracle processes. The problem was 100% CPU utilization with a global priority queue of 120, a memory queue of 60, and a disk queue of 20. I'm trying to find out specifically what was the cause using HP OV Performance Manager. When I look at process data I can see lots of processes blocked on PRI and VM. I can see that my root/swap disks are hot.

I look at swap and there were no pageout and the swapout rate was 0. Global Disk VM IO is higher than normal but not by much (30,000 to 165,000), and GLobal Pagein are high 16,000 to 20,000.

I'm having a bit of a problem pinpointing where the problem started or came from.

Any help would be appreciated.

Re: Need help troubleshooting performance issue

Steven E. Protter — Mon, 02 Nov 2009 15:59:21 GMT

Shalom,

Looks like a process or two was bound to CPU and not playing nice with other processes.

To see the issue in real time, you would want to run glance or gpm and see what processes are running at the time.

Some part of this data is lying to you.

You say root and swap disks were hot but you got no pageing. This is not telling a consistent story.

http://www.hpux.ws/?p=6

I'd set up a collection run on the script above to see if you can spot anything.

There is a top snapshot that gets done that might help you identify the processes.

SEP

Re: Need help troubleshooting performance issue

Tony Williams — Mon, 02 Nov 2009 16:15:58 GMT

Thanks Stephen, sorry I should have said there are no page-outs there are a high number of page-ins, 200,000 to 800,000.

Re: Need help troubleshooting performance issue

Don Morris_1 — Mon, 02 Nov 2009 16:50:01 GMT

Before you stopped the Oracle processes, was there a large amount of swap actually consumed (not reserved)?

What you're describing sounds a lot like the Oracle processes produced sufficient memory pressure such that a large number of deactivations occurred (due to insufficient memory and lower priority than Oracle and affiliates). When Oracle went away, free memory rose -- and processes began to be reactivated. In your case, it sounds like reactivation in a "thundering herd" such that the scheduler and swap-in paths got swamped trying to handle all the new scheduling/paging requests of the herd coming back to life -- and logins suffered under the contention.

What OS version is this? What are your core kernel Process and Virtual Memory Management patch levels? (Deactivation/reactivation isn't a path that get stressed that much on performant systems, but I remember some work touching on that space such that patches may be relevant). Was there any pattern with the scheduling priority of the reactivated processes relative to your login/shell priorities? (Reactivation should be a more gradual thing -- if for no other reason, to ensure the memory pressure doesn't come right back so the system doesn't just thrash, but if all the deactivations were for higher-than-shell, but lower-than-Oracle priority, I can imagine a herd forming...)

Re: Need help troubleshooting performance issue

Michael Steele_2 — Mon, 02 Nov 2009 17:26:17 GMT

Hi

So you have a classic race condition developed by your application (* very likely *) and you can id the responsible pid by collecting data over time with a 15 minute cron.

Refer to the 'ps' man page and the -o option, especially pcpu and vmz and comm then collect the data in an outfile.

UNIX95=1 ps -ef -o pcpu,state,pid,ppid,comm | sort -rn | head -15

UNIX95=1 ps -ef -o vsz,state,pid,ppid,comm | sort -rn | head 15

vmstat 5 5

sar -d 5 5 (* disk bottlenecks *)

And any other command that you'd like to check.

Re: Need help troubleshooting performance issue

Tony Williams — Mon, 02 Nov 2009 17:28:19 GMT

... Before you stopped the Oracle processes, was there a large amount of swap actually consumed (not reserved)?

No, we see about 1% on a system with 256GB of main memory.

What OS version is this?
11.23

Re: Need help troubleshooting performance issue

Steven E. Protter — Mon, 02 Nov 2009 17:35:23 GMT

Shalom,

swapinto -tam

I'd like to see the ratio of main memory to swap. I've been running into some issues here where swap reservation requests are failing on a very large system.

Might be able to share some insights.

SEP

Re: Need help troubleshooting performance issue

Tony Williams — Mon, 02 Nov 2009 18:41:52 GMT

The 1% on the device line was from our problem period, we normally are 0.

Mb Mb Mb PCT START/ Mb
TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME
dev 71680 710 70940 1% 0 - 1 /dev/vg00/lvol2
dev 131072 702 130348 1% 0 - 1 /dev/vg00/swap2
dev 131072 701 130349 1% 0 - 1 /dev/vg00/swap3
dev 131072 706 130344 1% 0 - 1 /dev/vg00/swap4
dev 131072 705 130345 1% 0 - 1 /dev/vg00/swap5
dev 131072 710 130340 1% 0 - 1 /dev/vg00/swap6
reserve - 563273 -563273
memory 524023 124095 399928 24%
total 1251063 691602 559321 55% - 0 -

Re: Need help troubleshooting performance issue

Michael Steele_2 — Mon, 02 Nov 2009 22:03:12 GMT

Sure doesn't look like swap

Re: Need help troubleshooting performance issue

Patrick Wallek — Mon, 02 Nov 2009 22:14:16 GMT

>>there are a high number of page-ins, 200,000 to 800,000.

Do you see a larger than usual number of processes that were started during the time you were having issues?

Re: Need help troubleshooting performance issue

Tony Williams — Mon, 02 Nov 2009 22:31:01 GMT

Over most of the day and during this time the number of processes are steady from 3700 to 3900. I'm attaching a Excel 2007 spreadsheet with a lot of the Global metrics I'm looking at. B4 the CPU and Global priority queue go up, The root disk become very busy (400% utilization) page requests and free memory start to go down,

Re: Need help troubleshooting performance issue

Michael Steele_2 — Mon, 02 Nov 2009 22:47:38 GMT

Hi

Need to see a breakdown by process - Please include the reports requested above

Re: Need help troubleshooting performance issue

Tony Williams — Mon, 02 Nov 2009 22:53:06 GMT

I don't seem to be able toopen the file so I, going to try a excell 2003 format.

Re: Need help troubleshooting performance issue

Tony Williams — Mon, 02 Nov 2009 23:06:02 GMT

Thanks Michael,

I have included all but the sar -d there are several thousand disks so the output is very long.

Re: Need help troubleshooting performance issue

Michael Steele_2 — Mon, 02 Nov 2009 23:41:59 GMT

1049892 R 18018 1 java
966304 S 6595 10378 dw.sapTPQ_DVEBMGS82
716076 S 28075 8116 dw.sapEWD_DVEBMGS32
609112 S 24831 24816 dw.sapERQ_DVEBMGS29
565144 S 10060 24816 dw.sapERQ_DVEBMGS29

So keep an eye on these processes.

Put the UNIX95 commands in a 15 cron and save the data for at least two days unless you see vsz process growth greater than the above.

Include sar -c -u and -v.

Attach the sar -d Totals: Note any avwait > aserv disks and pvdisplay -v those disks and note the file systems.

Re: Need help troubleshooting performance issue

Tony Williams — Tue, 03 Nov 2009 02:15:37 GMT

Thanks Michael,

This problem happened this past Friday and the 1st order of business is always to free up resources, its only after everything is back to normal that we start looking to see what the problem was. Your suggestions will help if we have the problem again.

Re: Need help troubleshooting performance issue

Raj D. — Tue, 03 Nov 2009 03:01:10 GMT

Tony,

>>The problem was 100% CPU utilization with a global priority queue of 120, a memory queue of 60, and a disk queue of 20.

The 100% cpu utilization and pri.queue 120 says it all. And login takes 10 minute as priority queue was high with cpu bottleneck.

Here the question would be:
- Did you see any increased load at that time. i.e may be more oracle process or more java process or more application than usual scenario, or more batch was executed.
- How many cpu do you have . What is the model of the server.
- How many process wa runningduring that time, and how many process runs at usual load.
- what was the load factor at that time. Obviously it would be more than 1, 2 ..
- What measureware 'extract' report shows the historical data of cpu/mem/io/swap/network in/out etc.
From above we can narrow down the cause,

Hth,
Raj.

Re: Need help troubleshooting performance issue

Michael Steele_2 — Tue, 03 Nov 2009 03:13:45 GMT

HI

What is this process?

1049892 R 18018 1 java : First in virtual memory and gone to init. Is that normal for it to go to init or should it have a parent pid?

What is this process?

90.82 R 18669 18375 jlaunch : 2nd in cpu activity only behind the kernel.

Java login?

Question to Others:

Is it normal for 'kernel' to be consumming the most CPU time?

Re: Need help troubleshooting performance issue

Michael Steele_2 — Tue, 03 Nov 2009 03:15:24 GMT

HI

What HP-UX version?

Is this a virtual server or what?

Re: Need help troubleshooting performance issue

Raj D. — Tue, 03 Nov 2009 03:36:17 GMT

Tony,

>>I have included all but the sar -d there are several thousand disks so the output is very long.

Well, to get a clear idea quickly , if the disks are hitting heavily ,you can check with a small scrpt (sar -d) to find out disk and their correspoding vgs . (check the attached one : find_high_io_wait_11iv2.sc ) . Then if you see the avwait is more you can try to locate the cause of the problem.

Hope this helps..,
Raj.