- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - Linux
- >
- Re: Server Hanging - 15 - 30 mins w/ Very High Loa...
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-04-2012 07:22 AM
03-04-2012 07:22 AM
Server Hanging - 15 - 30 mins w/ Very High Load, then Resumes
RHEL 5.7, 24-way Intel Dunnington CPUs (4x Xeon 7400's - 6core), 128GB RAM 4x4Gbit FC Storage Channels to a Tier-1 XP12K SAN Storage. Server occacionally hangs - becomes inacccessible vis console, ssh or SQLNet (Server is a DB host) but remains pingable. The issue always seem to be characterised by very high system loads (sometime reaching over 500!) and %system CPU reaching 100%, with kswapd and kjournald processes becoming hyper active.
RHEL Support claims a vMcore needs to be produced whilst the system is hanging but we've not been able to as everytime we're called - Server is already acccessible. We have SAR and TOP stats runing every minute. And below are the stats from "top" during when it is hanging:
top - 03:19:15 up 14 days, 10:49, 1 user, load average: 9.23, 5.61, 4.40 Tasks: 2728 total, 7 running, 2713 sleeping, 0 stopped, 8 zombie Cpu(s): 9.8%us, 13.6%sy, 0.0%ni, 42.6%id, 33.0%wa, 0.0%hi, 0.9%si, 0.0%st Mem: 131576432k total, 130991428k used, 585004k free, 128996k buffers Swap: 41943032k total, 10339928k used, 31603104k free, 19088740k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1613 root 13 -5 0 0 0 D 24.8 0.0 74:08.32 kswapd0 30371 oracle 16 0 752m 60m 15m S 24.1 0.0 0:09.15 oraagent.bin 27431 oracle 15 0 4247m 486m 481m S 20.2 0.4 3:14.73 oracle top - 03:19:47 up 14 days, 10:49, 1 user, load average: 11.08, 6.38, 4.70 Tasks: 2772 total, 38 running, 2726 sleeping, 0 stopped, 8 zombie Cpu(s): 16.6%us, 26.3%sy, 0.0%ni, 22.5%id, 33.6%wa, 0.0%hi, 0.9%si, 0.0%st Mem: 131576432k total, 131015736k used, 560696k free, 7416k buffers Swap: 41943032k total, 10375728k used, 31567304k free, 18956176k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1613 root 12 -5 0 0 0 R 36.2 0.0 74:19.23 kswapd0 31361 oracle 15 0 4248m 256m 251m S 21.4 0.2 0:07.34 oracle 27431 oracle 15 0 4247m 486m 481m S 20.5 0.4 3:20.90 oracle top - 03:20:27 up 14 days, 10:50, 1 user, load average: 47.72, 16.35, 8.12 Tasks: 2785 total, 46 running, 2731 sleeping, 0 stopped, 8 zombie Cpu(s): 9.9%us, 76.8%sy, 0.0%ni, 4.3%id, 7.1%wa, 0.0%hi, 2.0%si, 0.0%st Mem: 131576432k total, 130953344k used, 623088k free, 10596k buffers Swap: 41943032k total, 10418420k used, 31524612k free, 18922924k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30371 oracle 15 0 752m 60m 15m S 131.6 0.0 0:54.58 oraagent.bin 1613 root 20 -5 0 0 0 R 65.0 0.0 74:39.92 kswapd0 11028 oracle 18 0 836m 122m 15m S 39.3 0.1 74:48.09 emagent top - 03:21:00 up 14 days, 10:51, 1 user, load average: 70.53, 25.43, 11.46 Tasks: 2802 total, 67 running, 2726 sleeping, 0 stopped, 9 zombie Cpu(s): 10.1%us, 88.8%sy, 0.0%ni, 0.3%id, 0.2%wa, 0.0%hi, 0.6%si, 0.0%st Mem: 131576432k total, 130992144k used, 584288k free, 2160k buffers Swap: 41943032k total, 10408800k used, 31534232k free, 18929308k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30371 oracle 15 0 752m 60m 15m S 136.1 0.0 1:49.28 oraagent.bin 1613 root 20 -5 0 0 0 R 82.0 0.0 75:12.87 kswapd0 11028 oracle 18 0 991m 124m 15m S 63.7 0.1 75:13.68 emagent top - 03:21:32 up 14 days, 10:51, 1 user, load average: 60.68, 27.65, 12.66 Tasks: 2797 total, 32 running, 2757 sleeping, 0 stopped, 8 zombie Cpu(s): 15.1%us, 46.5%sy, 0.0%ni, 12.9%id, 25.0%wa, 0.0%hi, 0.4%si, 0.0%st Mem: 131576432k total, 131016496k used, 559936k free, 19188k buffers Swap: 41943032k total, 10368964k used, 31574068k free, 18969072k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30371 oracle 15 0 752m 60m 15m S 43.5 0.0 2:03.61 oraagent.bin 1613 root 20 -5 0 0 0 R 38.7 0.0 75:25.60 kswapd0 31406 oracle 20 0 4255m 268m 258m R 30.6 0.2 0:17.57 oracle 28893 root 11 -5 0 0 0 R 27.3 0.0 45:44.49 kjournald top - 03:22:12 up 14 days, 10:52, 1 user, load average: 76.49, 35.24, 15.81 Tasks: 2807 total, 62 running, 2734 sleeping, 0 stopped, 11 zombie Cpu(s): 5.5%us, 91.2%sy, 0.0%ni, 2.2%id, 0.9%wa, 0.0%hi, 0.3%si, 0.0%st Mem: 131576432k total, 130993112k used, 583320k free, 1704k buffers Swap: 41943032k total, 10364776k used, 31578256k free, 18965324k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30371 oracle 15 0 752m 60m 15m S 167.3 0.0 2:56.63 oraagent.bin 1613 root 20 -5 0 0 0 S 93.1 0.0 75:55.12 kswapd0 28893 root 20 -5 0 0 0 R 85.6 0.0 46:11.63 kjournald 11028 oracle 18 0 967m 124m 15m S 70.5 0.1 75:36.64 emagent top - 03:39:53 up 14 days, 11:09, 1 user, load average: 596.27, 581.22, 392.10 Tasks: 2346 total, 36 running, 2294 sleeping, 0 stopped, 16 zombie Cpu(s): 1.4%us, 98.5%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 131576432k total, 126810428k used, 4766004k free, 3352k buffers Swap: 41943032k total, 10298216k used, 31644816k free, 19091144k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 28893 root 20 -5 0 0 0 R 100.0 0.0 53:19.90 kjournald 28887 root 10 -5 0 0 0 S 100.0 0.0 16:44.68 kjournald 1613 root 20 -5 0 0 0 R 100.0 0.0 81:59.26 kswapd0 19176 root 3 -20 32948 11m 1604 D 300.0 0.0 1065:53 perfd 10916 root 25 0 522m 108m 7368 S 433.8 0.1 15:00.23 java top - 03:40:23 up 14 days, 11:10, 1 user, load average: 381.08, 530.94, 381.42 Tasks: 2370 total, 3 running, 2359 sleeping, 0 stopped, 8 zombie Cpu(s): 30.2%us, 23.4%sy, 0.0%ni, 16.0%id, 29.7%wa, 0.0%hi, 0.6%si, 0.0%st Mem: 131576432k total, 130872156k used, 704276k free, 53176k buffers Swap: 41943032k total, 8318688k used, 33624344k free, 21312200k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11421 oracle 15 0 8516m 3.7g 3.6g S 0.9 2.9 22:45.17 oracle 1613 root 20 -5 0 0 0 S 0.7 0.0 82:06.21 kswapd0 1558 oracle 15 0 4249m 158m 152m S 0.5 0.1 0:09.38 oracle top - 03:40:53 up 14 days, 11:10, 1 user, load average: 263.65, 487.79, 371.79 Tasks: 2455 total, 67 running, 2380 sleeping, 0 stopped, 8 zombie Cpu(s): 9.6%us, 63.2%sy, 0.0%ni, 20.8%id, 6.2%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 131576432k total, 131015320k used, 561112k free, 17856k buffers Swap: 41943032k total, 9957804k used, 31985228k free, 19449320k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7559 oracle 15 0 744m 55m 15m S 106.8 0.0 0:34.80 oraagent.bin 1613 root 12 -5 0 0 0 R 48.8 0.0 82:20.91 kswapd0 28893 root 19 -5 0 0 0 R 37.6 0.0 53:31.32 kjournald 11421 oracle 15 0 8516m 3.7g 3.6g S 37.3 2.9 22:56.41 oracle top - 03:42:55 up 14 days, 11:13, 1 user, load average: 50.13, 333.61, 329.47 Tasks: 2598 total, 4 running, 2586 sleeping, 0 stopped, 8 zombie Cpu(s): 13.8%us, 4.2%sy, 0.0%ni, 57.1%id, 24.6%wa, 0.0%hi, 0.3%si, 0.0%st Mem: 131576432k total, 130792168k used, 784264k free, 61412k buffers Swap: 41943032k total, 11228692k used, 30714340k free, 18339848k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30012 root 11 -10 36324 12m 2512 S 11.2 0.0 1808:45 glance 1017 oracle 15 0 4250m 353m 347m S 10.3 0.3 0:44.81 oracle 8238 oracle 16 0 4243m 467m 464m R 10.1 0.4 0:18.12 oracle
The huge gap where there was no TOP stats was when the system was totally unresponsive/hung.
Any clues as to what we may be facing here?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-04-2012 10:29 AM
03-04-2012 10:29 AM
Re: Server Hanging - 15 - 30 mins w/ Very High Load, then Resumes
>RHEL Support claims a vMcore needs to be produced whilst the system is hanging
Does this cause a system reboot? If so, how will your users feel about a temporary 15 minute hang vs much longer reboot and possible database recovery?
>We have SAR and TOP stats running every minute.
Anything useful in the sar stats?
As you said, your system seems to be near 98.5%sy. And a load average of almost 600!
>The huge gap where there was no TOP stats
(It would have helped if you marked the gap so we didn't have to look for it.)
Would it make sense to renice your stats processes to a negative value so they don't hang?
Do you have any output from glance?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-04-2012 08:30 PM
03-04-2012 08:30 PM
Re: Server Hanging - 15 - 30 mins w/ Very High Load, then Resumes
The tell tale signs are very much what I provided in the per minute TOP stats in batch mode above.
Glance also hiccuped in the ~ 15 minutes the server was "hung".. And NO the server never reboots - it just simply hangs for anywhere between 5 minutes to close to an hour for busy DB servers.
This issue just seemed to have started happening after we moved to Oracle 11GR2 (11.2.0.2 + patches is our current version). Our HP-UX to Linux migration was completed under 10GR2 (10.2.0.5) and for close to a year -- our databases have been remarkably stable.
RHEL support has so far been unable to help except to remind us of Oracle best practice on RHEL Linux which we've already done (HugePages vm.* kernel settings, etc.). They demand a vmcore during the actrual hang so they can check what's going on and so far we've been unlucky to trigger a vmcore (sysRQ+C) as everytime we aremade aware -- the server has already resumed from the hung state.
The Servers are Dunnington Xeon 7400 (6-core) - which predated the Nehalems. They are certainly oodles slow - emmory and I/O bus wise as these were teh last of the Xeons that were absed on the old front side bus achitecture.One suspicion is that the architecture of these Xeon 7400s were never meant to scale (even if these were the first X86 servers to breah the million TPC mark back in 2009.
We are now thinkking of moving to Oracle Linux with the UEK kernel for hopefuly better support and better troubleshooting. Oracle do develop their entire software stacj including the Database on their own fork of Linux.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-04-2012 11:17 PM
03-04-2012 11:17 PM
Re: Server Hanging - 15 - 30 mins w/ Very High Load, then Resumes
>And NO the server never reboots
I meant that if you had to take a dump, you would have to reboot.