System Administration
cancel
Showing results for 
Search instead for 
Did you mean: 

Server Hanging - 15 - 30 mins w/ Very High Load, then Resumes

 
Alzhy
Honored Contributor

Server Hanging - 15 - 30 mins w/ Very High Load, then Resumes

RHEL 5.7, 24-way Intel Dunnington CPUs (4x Xeon 7400's - 6core), 128GB RAM 4x4Gbit FC Storage Channels to a Tier-1 XP12K SAN Storage. Server occacionally hangs - becomes inacccessible  vis console,  ssh or SQLNet (Server is a DB host) but remains pingable. The issue always seem to be characterised by very high system loads (sometime reaching over 500!) and %system CPU reaching 100%, with kswapd and kjournald processes becoming hyper active.

 

RHEL Support claims a vMcore needs to be produced whilst the system is hanging but we've not been able to as everytime we're called - Server is already acccessible. We have SAR and TOP stats runing every minute. And below are the stats from "top" during when it is hanging:

 

top - 03:19:15 up 14 days, 10:49,  1 user,  load average: 9.23, 5.61, 4.40
Tasks: 2728 total,   7 running, 2713 sleeping,   0 stopped,   8 zombie
Cpu(s):  9.8%us, 13.6%sy,  0.0%ni, 42.6%id, 33.0%wa,  0.0%hi,  0.9%si,  0.0%st
Mem:  131576432k total, 130991428k used,   585004k free,   128996k buffers
Swap: 41943032k total, 10339928k used, 31603104k free, 19088740k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1613 root      13  -5     0    0    0 D 24.8  0.0  74:08.32 kswapd0
30371 oracle    16   0  752m  60m  15m S 24.1  0.0   0:09.15 oraagent.bin
27431 oracle    15   0 4247m 486m 481m S 20.2  0.4   3:14.73 oracle


top - 03:19:47 up 14 days, 10:49,  1 user,  load average: 11.08, 6.38, 4.70
Tasks: 2772 total,  38 running, 2726 sleeping,   0 stopped,   8 zombie
Cpu(s): 16.6%us, 26.3%sy,  0.0%ni, 22.5%id, 33.6%wa,  0.0%hi,  0.9%si,  0.0%st
Mem:  131576432k total, 131015736k used,   560696k free,     7416k buffers
Swap: 41943032k total, 10375728k used, 31567304k free, 18956176k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1613 root      12  -5     0    0    0 R 36.2  0.0  74:19.23 kswapd0
31361 oracle    15   0 4248m 256m 251m S 21.4  0.2   0:07.34 oracle
27431 oracle    15   0 4247m 486m 481m S 20.5  0.4   3:20.90 oracle

top - 03:20:27 up 14 days, 10:50,  1 user,  load average: 47.72, 16.35, 8.12
Tasks: 2785 total,  46 running, 2731 sleeping,   0 stopped,   8 zombie
Cpu(s):  9.9%us, 76.8%sy,  0.0%ni,  4.3%id,  7.1%wa,  0.0%hi,  2.0%si,  0.0%st
Mem:  131576432k total, 130953344k used,   623088k free,    10596k buffers
Swap: 41943032k total, 10418420k used, 31524612k free, 18922924k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
30371 oracle    15   0  752m  60m  15m S 131.6  0.0   0:54.58 oraagent.bin
 1613 root      20  -5     0    0    0 R 65.0  0.0  74:39.92 kswapd0
11028 oracle    18   0  836m 122m  15m S 39.3  0.1  74:48.09 emagent

top - 03:21:00 up 14 days, 10:51,  1 user,  load average: 70.53, 25.43, 11.46
Tasks: 2802 total,  67 running, 2726 sleeping,   0 stopped,   9 zombie
Cpu(s): 10.1%us, 88.8%sy,  0.0%ni,  0.3%id,  0.2%wa,  0.0%hi,  0.6%si,  0.0%st
Mem:  131576432k total, 130992144k used,   584288k free,     2160k buffers
Swap: 41943032k total, 10408800k used, 31534232k free, 18929308k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
30371 oracle    15   0  752m  60m  15m S 136.1  0.0   1:49.28 oraagent.bin
 1613 root      20  -5     0    0    0 R 82.0  0.0  75:12.87 kswapd0
11028 oracle    18   0  991m 124m  15m S 63.7  0.1  75:13.68 emagent

top - 03:21:32 up 14 days, 10:51,  1 user,  load average: 60.68, 27.65, 12.66
Tasks: 2797 total,  32 running, 2757 sleeping,   0 stopped,   8 zombie
Cpu(s): 15.1%us, 46.5%sy,  0.0%ni, 12.9%id, 25.0%wa,  0.0%hi,  0.4%si,  0.0%st
Mem:  131576432k total, 131016496k used,   559936k free,    19188k buffers
Swap: 41943032k total, 10368964k used, 31574068k free, 18969072k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
30371 oracle    15   0  752m  60m  15m S 43.5  0.0   2:03.61 oraagent.bin
 1613 root      20  -5     0    0    0 R 38.7  0.0  75:25.60 kswapd0
31406 oracle    20   0 4255m 268m 258m R 30.6  0.2   0:17.57 oracle
28893 root      11  -5     0    0    0 R 27.3  0.0  45:44.49 kjournald

top - 03:22:12 up 14 days, 10:52,  1 user,  load average: 76.49, 35.24, 15.81
Tasks: 2807 total,  62 running, 2734 sleeping,   0 stopped,  11 zombie
Cpu(s):  5.5%us, 91.2%sy,  0.0%ni,  2.2%id,  0.9%wa,  0.0%hi,  0.3%si,  0.0%st
Mem:  131576432k total, 130993112k used,   583320k free,     1704k buffers
Swap: 41943032k total, 10364776k used, 31578256k free, 18965324k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
30371 oracle    15   0  752m  60m  15m S 167.3  0.0   2:56.63 oraagent.bin
 1613 root      20  -5     0    0    0 S 93.1  0.0  75:55.12 kswapd0
28893 root      20  -5     0    0    0 R 85.6  0.0  46:11.63 kjournald
11028 oracle    18   0  967m 124m  15m S 70.5  0.1  75:36.64 emagent

top - 03:39:53 up 14 days, 11:09,  1 user,  load average: 596.27, 581.22, 392.10
Tasks: 2346 total,  36 running, 2294 sleeping,   0 stopped,  16 zombie
Cpu(s):  1.4%us, 98.5%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  131576432k total, 126810428k used,  4766004k free,     3352k buffers
Swap: 41943032k total, 10298216k used, 31644816k free, 19091144k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28893 root      20  -5     0    0    0 R 100.0  0.0  53:19.90 kjournald
28887 root      10  -5     0    0    0 S 100.0  0.0  16:44.68 kjournald
 1613 root      20  -5     0    0    0 R 100.0  0.0  81:59.26 kswapd0
19176 root       3 -20 32948  11m 1604 D 300.0  0.0   1065:53 perfd
10916 root      25   0  522m 108m 7368 S 433.8  0.1  15:00.23 java

top - 03:40:23 up 14 days, 11:10,  1 user,  load average: 381.08, 530.94, 381.42
Tasks: 2370 total,   3 running, 2359 sleeping,   0 stopped,   8 zombie
Cpu(s): 30.2%us, 23.4%sy,  0.0%ni, 16.0%id, 29.7%wa,  0.0%hi,  0.6%si,  0.0%st
Mem:  131576432k total, 130872156k used,   704276k free,    53176k buffers
Swap: 41943032k total,  8318688k used, 33624344k free, 21312200k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
11421 oracle    15   0 8516m 3.7g 3.6g S  0.9  2.9  22:45.17 oracle
 1613 root      20  -5     0    0    0 S  0.7  0.0  82:06.21 kswapd0
 1558 oracle    15   0 4249m 158m 152m S  0.5  0.1   0:09.38 oracle

top - 03:40:53 up 14 days, 11:10,  1 user,  load average: 263.65, 487.79, 371.79
Tasks: 2455 total,  67 running, 2380 sleeping,   0 stopped,   8 zombie
Cpu(s):  9.6%us, 63.2%sy,  0.0%ni, 20.8%id,  6.2%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:  131576432k total, 131015320k used,   561112k free,    17856k buffers
Swap: 41943032k total,  9957804k used, 31985228k free, 19449320k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7559 oracle    15   0  744m  55m  15m S 106.8  0.0   0:34.80 oraagent.bin
 1613 root      12  -5     0    0    0 R 48.8  0.0  82:20.91 kswapd0
28893 root      19  -5     0    0    0 R 37.6  0.0  53:31.32 kjournald
11421 oracle    15   0 8516m 3.7g 3.6g S 37.3  2.9  22:56.41 oracle

top - 03:42:55 up 14 days, 11:13,  1 user,  load average: 50.13, 333.61, 329.47
Tasks: 2598 total,   4 running, 2586 sleeping,   0 stopped,   8 zombie
Cpu(s): 13.8%us,  4.2%sy,  0.0%ni, 57.1%id, 24.6%wa,  0.0%hi,  0.3%si,  0.0%st
Mem:  131576432k total, 130792168k used,   784264k free,    61412k buffers
Swap: 41943032k total, 11228692k used, 30714340k free, 18339848k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
30012 root      11 -10 36324  12m 2512 S 11.2  0.0   1808:45 glance
 1017 oracle    15   0 4250m 353m 347m S 10.3  0.3   0:44.81 oracle
 8238 oracle    16   0 4243m 467m 464m R 10.1  0.4   0:18.12 oracle

 

 The huge gap where there was no TOP stats was when the system was totally unresponsive/hung.

 

Any clues as to what we may be facing here?

 

Hakuna Matata.
3 REPLIES 3
Dennis Handly
Acclaimed Contributor

Re: Server Hanging - 15 - 30 mins w/ Very High Load, then Resumes

>RHEL Support claims a vMcore needs to be produced whilst the system is hanging

 

Does this cause a system reboot?  If so, how will your users feel about a temporary 15 minute hang vs much longer reboot and possible database recovery?

 

>We have SAR and TOP stats running every minute.

 

Anything useful in the sar stats?

 

As you said, your system seems to be near 98.5%sy.  And a load average of almost 600!

 

>The huge gap where there was no TOP stats

 

(It would have helped if you marked the gap so we didn't have to look for it.)

Would it make sense to renice your stats processes to a negative value so they don't hang?

 

Do you have any output from glance?

Alzhy
Honored Contributor

Re: Server Hanging - 15 - 30 mins w/ Very High Load, then Resumes

The tell tale signs are very much what I provided in the per minute TOP stats in batch mode above.

 

Glance also hiccuped in the ~ 15 minutes the server was "hung".. And NO the server never reboots - it just simply hangs for anywhere between 5 minutes to close to an hour for busy DB servers.

 

This issue just seemed to have started happening after we moved to Oracle 11GR2 (11.2.0.2 + patches is our current version). Our HP-UX to Linux migration was completed under 10GR2 (10.2.0.5) and for close to a year -- our databases have been remarkably stable.

 

RHEL support has so far been unable to help except to remind us of Oracle best practice on RHEL Linux which we've already done (HugePages vm.* kernel settings, etc.). They  demand a vmcore during the actrual hang so they can check what's going on and so far we've been unlucky to trigger a vmcore (sysRQ+C) as everytime we aremade  aware -- the server has already resumed from the hung state.

 

The Servers are Dunnington Xeon 7400 (6-core) - which predated the Nehalems. They are certainly oodles slow - emmory and I/O bus wise as these were teh last of the Xeons that were absed on the old front side bus achitecture.One suspicion is that the architecture of these Xeon 7400s were never meant to scale (even if these were the first X86 servers to breah the million TPC mark back in 2009.

 

We are now thinkking of moving to Oracle Linux with the UEK kernel for hopefuly better support and better troubleshooting. Oracle do develop their entire software stacj including the Database on their own fork of Linux.

 

Hakuna Matata.
Highlighted
Dennis Handly
Acclaimed Contributor

Re: Server Hanging - 15 - 30 mins w/ Very High Load, then Resumes

>And NO the server never reboots

 

I meant that if you had to take a dump, you would have to reboot.