- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- Re: Need help troubleshooting performance issue
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 07:54 AM
11-02-2009 07:54 AM
Need help troubleshooting performance issue
We had a performance problem on Friday where we brought a server to its knees, logins took 10 minutes process. The problem lasted 30 minutes until we stopped a few Oracle processes. The problem was 100% CPU utilization with a global priority queue of 120, a memory queue of 60, and a disk queue of 20. I'm trying to find out specifically what was the cause using HP OV Performance Manager. When I look at process data I can see lots of processes blocked on PRI and VM. I can see that my root/swap disks are hot.
I look at swap and there were no pageout and the swapout rate was 0. Global Disk VM IO is higher than normal but not by much (30,000 to 165,000), and GLobal Pagein are high 16,000 to 20,000.
I'm having a bit of a problem pinpointing where the problem started or came from.
Any help would be appreciated.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 07:59 AM
11-02-2009 07:59 AM
Re: Need help troubleshooting performance issue
Looks like a process or two was bound to CPU and not playing nice with other processes.
To see the issue in real time, you would want to run glance or gpm and see what processes are running at the time.
Some part of this data is lying to you.
You say root and swap disks were hot but you got no pageing. This is not telling a consistent story.
http://www.hpux.ws/?p=6
I'd set up a collection run on the script above to see if you can spot anything.
There is a top snapshot that gets done that might help you identify the processes.
SEP
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 08:15 AM
11-02-2009 08:15 AM
Re: Need help troubleshooting performance issue
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 08:50 AM
11-02-2009 08:50 AM
Re: Need help troubleshooting performance issue
What you're describing sounds a lot like the Oracle processes produced sufficient memory pressure such that a large number of deactivations occurred (due to insufficient memory and lower priority than Oracle and affiliates). When Oracle went away, free memory rose -- and processes began to be reactivated. In your case, it sounds like reactivation in a "thundering herd" such that the scheduler and swap-in paths got swamped trying to handle all the new scheduling/paging requests of the herd coming back to life -- and logins suffered under the contention.
What OS version is this? What are your core kernel Process and Virtual Memory Management patch levels? (Deactivation/reactivation isn't a path that get stressed that much on performant systems, but I remember some work touching on that space such that patches may be relevant). Was there any pattern with the scheduling priority of the reactivated processes relative to your login/shell priorities? (Reactivation should be a more gradual thing -- if for no other reason, to ensure the memory pressure doesn't come right back so the system doesn't just thrash, but if all the deactivations were for higher-than-shell, but lower-than-Oracle priority, I can imagine a herd forming...)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 09:26 AM
11-02-2009 09:26 AM
Re: Need help troubleshooting performance issue
So you have a classic race condition developed by your application (* very likely *) and you can id the responsible pid by collecting data over time with a 15 minute cron.
Refer to the 'ps' man page and the -o option, especially pcpu and vmz and comm then collect the data in an outfile.
UNIX95=1 ps -ef -o pcpu,state,pid,ppid,comm | sort -rn | head -15
UNIX95=1 ps -ef -o vsz,state,pid,ppid,comm | sort -rn | head 15
vmstat 5 5
sar -d 5 5 (* disk bottlenecks *)
And any other command that you'd like to check.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 09:28 AM
11-02-2009 09:28 AM
Re: Need help troubleshooting performance issue
No, we see about 1% on a system with 256GB of main memory.
What OS version is this?
11.23
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 09:35 AM
11-02-2009 09:35 AM
Re: Need help troubleshooting performance issue
swapinto -tam
I'd like to see the ratio of main memory to swap. I've been running into some issues here where swap reservation requests are failing on a very large system.
Might be able to share some insights.
SEP
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 10:41 AM
11-02-2009 10:41 AM
Re: Need help troubleshooting performance issue
Mb Mb Mb PCT START/ Mb
TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME
dev 71680 710 70940 1% 0 - 1 /dev/vg00/lvol2
dev 131072 702 130348 1% 0 - 1 /dev/vg00/swap2
dev 131072 701 130349 1% 0 - 1 /dev/vg00/swap3
dev 131072 706 130344 1% 0 - 1 /dev/vg00/swap4
dev 131072 705 130345 1% 0 - 1 /dev/vg00/swap5
dev 131072 710 130340 1% 0 - 1 /dev/vg00/swap6
reserve - 563273 -563273
memory 524023 124095 399928 24%
total 1251063 691602 559321 55% - 0 -
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 02:03 PM
11-02-2009 02:03 PM
Re: Need help troubleshooting performance issue
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 02:14 PM
11-02-2009 02:14 PM
Re: Need help troubleshooting performance issue
Do you see a larger than usual number of processes that were started during the time you were having issues?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 02:31 PM
11-02-2009 02:31 PM
Re: Need help troubleshooting performance issue
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 02:47 PM
11-02-2009 02:47 PM
Re: Need help troubleshooting performance issue
Need to see a breakdown by process - Please include the reports requested above
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 02:53 PM
11-02-2009 02:53 PM
Re: Need help troubleshooting performance issue
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 03:06 PM
11-02-2009 03:06 PM
Re: Need help troubleshooting performance issue
I have included all but the sar -d there are several thousand disks so the output is very long.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 03:41 PM
11-02-2009 03:41 PM
Re: Need help troubleshooting performance issue
966304 S 6595 10378 dw.sapTPQ_DVEBMGS82
716076 S 28075 8116 dw.sapEWD_DVEBMGS32
609112 S 24831 24816 dw.sapERQ_DVEBMGS29
565144 S 10060 24816 dw.sapERQ_DVEBMGS29
So keep an eye on these processes.
Put the UNIX95 commands in a 15 cron and save the data for at least two days unless you see vsz process growth greater than the above.
Include sar -c -u and -v.
Attach the sar -d Totals: Note any avwait > aserv disks and pvdisplay -v those disks and note the file systems.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 06:15 PM
11-02-2009 06:15 PM
Re: Need help troubleshooting performance issue
This problem happened this past Friday and the 1st order of business is always to free up resources, its only after everything is back to normal that we start looking to see what the problem was. Your suggestions will help if we have the problem again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 07:01 PM
11-02-2009 07:01 PM
Re: Need help troubleshooting performance issue
>>The problem was 100% CPU utilization with a global priority queue of 120, a memory queue of 60, and a disk queue of 20.
The 100% cpu utilization and pri.queue 120 says it all. And login takes 10 minute as priority queue was high with cpu bottleneck.
Here the question would be:
- Did you see any increased load at that time. i.e may be more oracle process or more java process or more application than usual scenario, or more batch was executed.
- How many cpu do you have . What is the model of the server.
- How many process wa runningduring that time, and how many process runs at usual load.
- what was the load factor at that time. Obviously it would be more than 1, 2 ..
- What measureware 'extract' report shows the historical data of cpu/mem/io/swap/network in/out etc.
From above we can narrow down the cause,
Hth,
Raj.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 07:13 PM
11-02-2009 07:13 PM
Re: Need help troubleshooting performance issue
What is this process?
1049892 R 18018 1 java : First in virtual memory and gone to init. Is that normal for it to go to init or should it have a parent pid?
What is this process?
90.82 R 18669 18375 jlaunch : 2nd in cpu activity only behind the kernel.
Java login?
Question to Others:
Is it normal for 'kernel' to be consumming the most CPU time?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 07:15 PM
11-02-2009 07:15 PM
Re: Need help troubleshooting performance issue
What HP-UX version?
Is this a virtual server or what?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 07:36 PM
11-02-2009 07:36 PM
Re: Need help troubleshooting performance issue
>>I have included all but the sar -d there are several thousand disks so the output is very long.
Well, to get a clear idea quickly , if the disks are hitting heavily ,you can check with a small scrpt (sar -d) to find out disk and their correspoding vgs . (check the attached one : find_high_io_wait_11iv2.sc ) . Then if you see the avwait is more you can try to locate the cause of the problem.
Hope this helps..,
Raj.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 07:43 PM
11-02-2009 07:43 PM
Re: Need help troubleshooting performance issue
From the output it is showing:
kernel ( pid=12326 ) --> using top cpu
java process (pid=18018) --> using top memory
swap utilization: --> normal.
disk i/o --> to be measure at that exact time of the issue. Or to be measure historically during runing heavy jobs.
- Also this data shows it was taken when cpu utilization was around ~55%. and not during 100%
You ca Prepare a script or multiple in advance and get ready to run during the performance crunch to pin point the cause.
Hth,
Raj.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 08:25 PM
11-02-2009 08:25 PM
Re: Need help troubleshooting performance issue
Here the question would be:
- Did you see any increased load at that time. i.e may be more oracle process or more java process or more application than usual scenario, or more batch was executed.
No increase every process that was running during the problem was running earlier in the day.
- How many cpu do you have . What is the model of the server.
16, Montecito based Superdome,
- How many process wa runningduring that time, and how many process runs at usual load.
a modest increase in active processes, for most of the day active processes were 1800 ~ 2000. During the 30 minute problem the processes jumped up to 2400 ~ 2500, then back down to 2000.
- what was the load factor at that time. Obviously it would be more than 1, 2 ..
A big increase in load >6,
- What measureware 'extract' report shows the historical data of cpu/mem/io/swap/network in/out etc.
From above we can narrow down the cause,
I have attached a text file of global metrics during a 30 minute period that the problem happened.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 08:36 PM
11-02-2009 08:36 PM
Re: Need help troubleshooting performance issue
What is this process?
1049892 R 18018 1 java : First in virtual memory and gone to init. Is that normal for it to go to init or should it have a parent pid?
Tnis is a SAP Netweaver processes. I don't know if its normal but when I look at that process its PPID is always init.
What is this process?
90.82 R 18669 18375 jlaunch : 2nd in cpu activity only behind the kernel.
Its a 2nd Netweaver process, both have VM profiles of > 6 GB.
Question to Others:
Is it normal for 'kernel' to be consumming the most CPU time?
kernel is a SAP application process. Yes its normal. SAP and Oracle consume a lot of this server. Its normal for most system resources to be hogh ~80%. I'm pretty sure its one of 5 processes that pushed the server over the edge, the 3 SAP processes, a Oracle Enterprise Manager Process, or a Backup process. The server goes back to normal when the OEM process is stopped.
So was it that one process, and if so what did it do to over consume the server, or was it a bad combination of 5 processes that all decided at that moment to increase their load?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 08:58 PM
11-02-2009 08:58 PM
Re: Need help troubleshooting performance issue
Would you attached the totals of the sar -d report?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2009 09:18 PM
11-02-2009 09:18 PM
Re: Need help troubleshooting performance issue
>> a modest increase in active processes, for most of the day active processes were 1800 ~ 2000. During the 30 minute problem the processes jumped up to 2400 ~ 2500, then back down to 2000.
- Well, 2000 to 2400 increase in process number are good amount of bump of processes, and it will consume large amount of resource. And in this case the processes are cpu intensive as cnsuming more cpu.
>> A big increase in load >6,
- This is a huge load for hp-ux system, I have seen 3 to 4 load factor makes the server freeze.
- 16:30 to 16:55 cpu utilization was 100%
- at that time only noticeabe change is little bit increase in swap usage : 4%.
That means the increased number of processes are consuming more cpu.
- next ste would be track down the process details, application details and try to figure out is it normal for those extra process to consume 70% of the cpu.
As it was bumped 30% to 70%.
I have seen a 128 monteito cpu SD performs low with increase in load. So the team who is putting the load on the server keep asking us how much is the load and accordingly they increase the load.
- If you get a difference between the current process and increase in process ( ps -ef ) , notify the application team that this 400 process caused cpu to go from 30% to 70%. And verify if it is normal . If it is normal , then the system may need more 'horse power'.
Hth,
Raj.