Re: Need help troubleshooting performance issue

Tony Williams · ‎11-02-2009

Hi,

We had a performance problem on Friday where we brought a server to its knees, logins took 10 minutes process. The problem lasted 30 minutes until we stopped a few Oracle processes. The problem was 100% CPU utilization with a global priority queue of 120, a memory queue of 60, and a disk queue of 20. I'm trying to find out specifically what was the cause using HP OV Performance Manager. When I look at process data I can see lots of processes blocked on PRI and VM. I can see that my root/swap disks are hot.

I look at swap and there were no pageout and the swapout rate was 0. Global Disk VM IO is higher than normal but not by much (30,000 to 165,000), and GLobal Pagein are high 16,000 to 20,000.

I'm having a bit of a problem pinpointing where the problem started or came from.

Any help would be appreciated.

Steven E. Protter · ‎11-02-2009

Shalom,

Looks like a process or two was bound to CPU and not playing nice with other processes.

To see the issue in real time, you would want to run glance or gpm and see what processes are running at the time.

Some part of this data is lying to you.

You say root and swap disks were hot but you got no pageing. This is not telling a consistent story.

http://www.hpux.ws/?p=6

I'd set up a collection run on the script above to see if you can spot anything.

There is a top snapshot that gets done that might help you identify the processes.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Tony Williams · ‎11-02-2009

Thanks Stephen, sorry I should have said there are no page-outs there are a high number of page-ins, 200,000 to 800,000.

Don Morris_1 · ‎11-02-2009

Before you stopped the Oracle processes, was there a large amount of swap actually consumed (not reserved)?

What you're describing sounds a lot like the Oracle processes produced sufficient memory pressure such that a large number of deactivations occurred (due to insufficient memory and lower priority than Oracle and affiliates). When Oracle went away, free memory rose -- and processes began to be reactivated. In your case, it sounds like reactivation in a "thundering herd" such that the scheduler and swap-in paths got swamped trying to handle all the new scheduling/paging requests of the herd coming back to life -- and logins suffered under the contention.

What OS version is this? What are your core kernel Process and Virtual Memory Management patch levels? (Deactivation/reactivation isn't a path that get stressed that much on performant systems, but I remember some work touching on that space such that patches may be relevant). Was there any pattern with the scheduling priority of the reactivated processes relative to your login/shell priorities? (Reactivation should be a more gradual thing -- if for no other reason, to ensure the memory pressure doesn't come right back so the system doesn't just thrash, but if all the deactivations were for higher-than-shell, but lower-than-Oracle priority, I can imagine a herd forming...)

Michael Steele_2 · ‎11-02-2009

Hi

So you have a classic race condition developed by your application (* very likely *) and you can id the responsible pid by collecting data over time with a 15 minute cron.

Refer to the 'ps' man page and the -o option, especially pcpu and vmz and comm then collect the data in an outfile.

UNIX95=1 ps -ef -o pcpu,state,pid,ppid,comm | sort -rn | head -15

UNIX95=1 ps -ef -o vsz,state,pid,ppid,comm | sort -rn | head 15

vmstat 5 5

sar -d 5 5 (* disk bottlenecks *)

And any other command that you'd like to check.

Support Fatherhood - Stop Family Law

Tony Williams · ‎11-02-2009

... Before you stopped the Oracle processes, was there a large amount of swap actually consumed (not reserved)?

No, we see about 1% on a system with 256GB of main memory.

What OS version is this?
11.23

Steven E. Protter · ‎11-02-2009

Shalom,

swapinto -tam

I'd like to see the ratio of main memory to swap. I've been running into some issues here where swap reservation requests are failing on a very large system.

Might be able to share some insights.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Tony Williams · ‎11-02-2009

The 1% on the device line was from our problem period, we normally are 0.

Mb Mb Mb PCT START/ Mb
TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME
dev 71680 710 70940 1% 0 - 1 /dev/vg00/lvol2
dev 131072 702 130348 1% 0 - 1 /dev/vg00/swap2
dev 131072 701 130349 1% 0 - 1 /dev/vg00/swap3
dev 131072 706 130344 1% 0 - 1 /dev/vg00/swap4
dev 131072 705 130345 1% 0 - 1 /dev/vg00/swap5
dev 131072 710 130340 1% 0 - 1 /dev/vg00/swap6
reserve - 563273 -563273
memory 524023 124095 399928 24%
total 1251063 691602 559321 55% - 0 -

Michael Steele_2 · ‎11-02-2009

Sure doesn't look like swap

Support Fatherhood - Stop Family Law

Patrick Wallek · ‎11-02-2009

>>there are a high number of page-ins, 200,000 to 800,000.

Do you see a larger than usual number of processes that were started during the time you were having issues?

Tony Williams · ‎11-02-2009

Over most of the day and during this time the number of processes are steady from 3700 to 3900. I'm attaching a Excel 2007 spreadsheet with a lot of the Global metrics I'm looking at. B4 the CPU and Global priority queue go up, The root disk become very busy (400% utilization) page requests and free memory start to go down,

Michael Steele_2 · ‎11-02-2009

Hi

Need to see a breakdown by process - Please include the reports requested above

Support Fatherhood - Stop Family Law

Tony Williams · ‎11-02-2009

I don't seem to be able toopen the file so I, going to try a excell 2003 format.

Tony Williams · ‎11-02-2009

Thanks Michael,

I have included all but the sar -d there are several thousand disks so the output is very long.

Michael Steele_2 · ‎11-02-2009

1049892 R 18018 1 java
966304 S 6595 10378 dw.sapTPQ_DVEBMGS82
716076 S 28075 8116 dw.sapEWD_DVEBMGS32
609112 S 24831 24816 dw.sapERQ_DVEBMGS29
565144 S 10060 24816 dw.sapERQ_DVEBMGS29

So keep an eye on these processes.

Put the UNIX95 commands in a 15 cron and save the data for at least two days unless you see vsz process growth greater than the above.

Include sar -c -u and -v.

Attach the sar -d Totals: Note any avwait > aserv disks and pvdisplay -v those disks and note the file systems.

Support Fatherhood - Stop Family Law

Tony Williams · ‎11-02-2009

Thanks Michael,

This problem happened this past Friday and the 1st order of business is always to free up resources, its only after everything is back to normal that we start looking to see what the problem was. Your suggestions will help if we have the problem again.

Raj D. · ‎11-02-2009

Tony,

>>The problem was 100% CPU utilization with a global priority queue of 120, a memory queue of 60, and a disk queue of 20.

The 100% cpu utilization and pri.queue 120 says it all. And login takes 10 minute as priority queue was high with cpu bottleneck.

Here the question would be:
- Did you see any increased load at that time. i.e may be more oracle process or more java process or more application than usual scenario, or more batch was executed.
- How many cpu do you have . What is the model of the server.
- How many process wa runningduring that time, and how many process runs at usual load.
- what was the load factor at that time. Obviously it would be more than 1, 2 ..
- What measureware 'extract' report shows the historical data of cpu/mem/io/swap/network in/out etc.
From above we can narrow down the cause,

Hth,
Raj.

" If u think u can , If u think u cannot , - You are always Right . "

Michael Steele_2 · ‎11-02-2009

HI

What is this process?

1049892 R 18018 1 java : First in virtual memory and gone to init. Is that normal for it to go to init or should it have a parent pid?

What is this process?

90.82 R 18669 18375 jlaunch : 2nd in cpu activity only behind the kernel.

Java login?

Question to Others:

Is it normal for 'kernel' to be consumming the most CPU time?

Support Fatherhood - Stop Family Law

Michael Steele_2 · ‎11-02-2009

HI

What HP-UX version?

Is this a virtual server or what?

Support Fatherhood - Stop Family Law

Raj D. · ‎11-02-2009

Tony,

>>I have included all but the sar -d there are several thousand disks so the output is very long.

Well, to get a clear idea quickly , if the disks are hitting heavily ,you can check with a small scrpt (sar -d) to find out disk and their correspoding vgs . (check the attached one : find_high_io_wait_11iv2.sc ) . Then if you see the avwait is more you can try to locate the cause of the problem.

Hope this helps..,
Raj.

" If u think u can , If u think u cannot , - You are always Right . "

Raj D. · ‎11-02-2009

again, Tony,

From the output it is showing:
kernel ( pid=12326 ) --> using top cpu
java process (pid=18018) --> using top memory
swap utilization: --> normal.
disk i/o --> to be measure at that exact time of the issue. Or to be measure historically during runing heavy jobs.

- Also this data shows it was taken when cpu utilization was around ~55%. and not during 100%

You ca Prepare a script or multiple in advance and get ready to run during the performance crunch to pin point the cause.

Hth,
Raj.

" If u think u can , If u think u cannot , - You are always Right . "

Tony Williams · ‎11-02-2009

Thanks Raj,

Here the question would be:
- Did you see any increased load at that time. i.e may be more oracle process or more java process or more application than usual scenario, or more batch was executed.

No increase every process that was running during the problem was running earlier in the day.

- How many cpu do you have . What is the model of the server.

16, Montecito based Superdome,

- How many process wa runningduring that time, and how many process runs at usual load.

a modest increase in active processes, for most of the day active processes were 1800 ~ 2000. During the 30 minute problem the processes jumped up to 2400 ~ 2500, then back down to 2000.

- what was the load factor at that time. Obviously it would be more than 1, 2 ..

A big increase in load >6,

- What measureware 'extract' report shows the historical data of cpu/mem/io/swap/network in/out etc.
From above we can narrow down the cause,

I have attached a text file of global metrics during a 30 minute period that the problem happened.

Tony Williams · ‎11-02-2009

Hi Michael,

What is this process?

1049892 R 18018 1 java : First in virtual memory and gone to init. Is that normal for it to go to init or should it have a parent pid?

Tnis is a SAP Netweaver processes. I don't know if its normal but when I look at that process its PPID is always init.

What is this process?

90.82 R 18669 18375 jlaunch : 2nd in cpu activity only behind the kernel.

Its a 2nd Netweaver process, both have VM profiles of > 6 GB.

Question to Others:

Is it normal for 'kernel' to be consumming the most CPU time?

kernel is a SAP application process. Yes its normal. SAP and Oracle consume a lot of this server. Its normal for most system resources to be hogh ~80%. I'm pretty sure its one of 5 processes that pushed the server over the edge, the 3 SAP processes, a Oracle Enterprise Manager Process, or a Backup process. The server goes back to normal when the OEM process is stopped.

So was it that one process, and if so what did it do to over consume the server, or was it a bad combination of 5 processes that all decided at that moment to increase their load?

Michael Steele_2 · ‎11-02-2009

Its hard for me to say because of the formatting but from 1630 to 1655 Disk I/O was 100%.

Would you attached the totals of the sar -d report?

Support Fatherhood - Stop Family Law

Raj D. · ‎11-02-2009

Tony,

>> a modest increase in active processes, for most of the day active processes were 1800 ~ 2000. During the 30 minute problem the processes jumped up to 2400 ~ 2500, then back down to 2000.

- Well, 2000 to 2400 increase in process number are good amount of bump of processes, and it will consume large amount of resource. And in this case the processes are cpu intensive as cnsuming more cpu.

>> A big increase in load >6,
- This is a huge load for hp-ux system, I have seen 3 to 4 load factor makes the server freeze.

- 16:30 to 16:55 cpu utilization was 100%
- at that time only noticeabe change is little bit increase in swap usage : 4%.
That means the increased number of processes are consuming more cpu.
- next ste would be track down the process details, application details and try to figure out is it normal for those extra process to consume 70% of the cpu.
As it was bumped 30% to 70%.
I have seen a 128 monteito cpu SD performs low with increase in load. So the team who is putting the load on the server keep asking us how much is the load and accordingly they increase the load.

- If you get a difference between the current process and increase in process ( ps -ef ) , notify the application team that this 400 process caused cpu to go from 30% to 70%. And verify if it is normal . If it is normal , then the system may need more 'horse power'.

Hth,
Raj.

" If u think u can , If u think u cannot , - You are always Right . "

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Need help troubleshooting performance issue

Need help troubleshooting performance issue