cancel
Showing results for 
Search instead for 
Did you mean: 

Server hang

dawn_jose85
Frequent Advisor

Server hang

Server got hung.
HP Proliant servers. It had shown errors in the console.
OS is
ipl315root#uname -a
Linux ipl315 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64
ipl315root#
I'm attaching the logs
top - 12:21:14 up 40 min, 2 users, load average: 1.28, 1.20, 1.42
Tasks: 440 total, 1 running, 439 sleeping, 0 stopped, 0 zombie
Cpu(s): 3.5%us, 1.4%sy, 0.0%ni, 94.8%id, 0.0%wa, 0.1%hi, 0.2%si, 0.0%st
Mem: 16305944k total, 8631928k used, 7674016k free, 354044k buffers
Swap: 24002632k total, 0k used, 24002632k free, 5448132k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12326 oracle 15 0 428m 116m 110m S 3 0.7 0:16.86 oracle
11563 oracle 15 0 429m 83m 77m S 0 0.5 0:08.11 oracle
7779 linus 15 0 365m 80m 22m S 2 0.5 0:36.75 BNLCCSR_sep
7774 linus 16 0 365m 79m 20m S 2 0.5 0:37.14 BNLCCSR_sep
7776 linus 15 0 365m 78m 20m S 1 0.5 0:36.77 BNLCCSR_sep
7771 linus 15 0 364m 78m 20m S 2 0.5 0:36.66 BNLCCSR_sep
7785 linus 15 0 364m 78m 20m S 2 0.5 0:36.73 BNLCCSR_sep
7782 linus 15 0 364m 78m 20m S 3 0.5 0:36.87 BNLCCSR_sep
11572 oracle 15 0 429m 78m 72m S 0 0.5 0:05.72 oracle
7770 linus 15 0 364m 78m 20m S 2 0.5 0:36.28 BNLCCSR_sep
7783 linus 15 0 364m 78m 20m S 2 0.5 0:37.43 BNLCCSR_sep
7780 linus 15 0 364m 78m 20m S 1 0.5 0:36.27 BNLCCSR_sep
7788 linus 15 0 364m 78m 20m S 1 0.5 0:36.18 BNLCCSR_sep
7775 linus 15 0 364m 78m 20m S 1 0.5 0:36.85 BNLCCSR_sep
7773 linus 15 0 364m 77m 20m S 1 0.5 0:37.07 BNLCCSR_sep
7787 linus 15 0 364m 77m 20m S 1 0.5 0:36.25 BNLCCSR_sep
7777 linus 15 0 364m 77m 20m S 1 0.5 0:37.07 BNLCCSR_sep
7786 linus 15 0 364m 77m 20m S 2 0.5 0:36.68 BNLCCSR_sep
7784 linus 15 0 364m 77m 20m S 2 0.5 0:36.31 BNLCCSR_sep
7778 linus 15 0 364m 77m 20m S 1 0.5 0:37.14 BNLCCSR_sep
7781 linus 15 0 364m 77m 20m S 2 0.5 0:36.70 BNLCCSR_sep
11568 oracle 15 0 429m 77m 71m S 0 0.5 0:05.93 oracle
7551 oracle 16 0 429m 72m 67m S 0 0.5 0:00.61 oracle
7630 oracle 15 0 429m 67m 61m S 0 0.4 0:00.96 oracle
7805 linus 15 0 373m 65m 23m S 0 0.4 0:05.66 BNLVISR_sep
7544 oracle 15 0 429m 64m 59m S 0 0.4 0:00.53 oracle
7694 oracle 15 0 429m 64m 59m S 0 0.4 0:00.69 oracle
11569 oracle 15 0 428m 64m 58m S 0 0.4 0:02.19 oracle
7620 oracle 15 0 429m 64m 57m S 0 0.4 0:00.87 oracle
7629 oracle 15 0 429m 64m 58m S 0 0.4 0:00.87 oracle
7615 oracle 15 0 429m 64m 58m S 0 0.4 0:00.70 oracle
7486 oracle 15 0 429m 63m 57m S 0 0.4 0:00.17 oracle
7806 linus 16 0 374m 63m 21m S 0 0.4 0:05.98 BNLVISR_sep
7557 oracle 15 0 429m 63m 58m S 0 0.4 0:00.38 oracle
7538 oracle 15 0 429m 63m 58m S 0 0.4 0:00.50 oracle
7540 oracle 16 0 429m 63m 58m S 0 0.4 0:00.52 oracle
7807 linus 15 0 372m 62m 21m S 0 0.4 0:06.01 BNLVISR_sep

888888888888888
ipl315root#free -m
total used free shared buffers cached
Mem: 15923 8431 7492 0 345 5320
-/+ buffers/cache: 2764 13159
Swap: 23440 0 23440

ipl315root#cat /proc/meminfo
MemTotal: 16305944 kB
MemFree: 7664060 kB
Buffers: 354852 kB
Cached: 5450384 kB
SwapCached: 0 kB
Active: 4262628 kB
Inactive: 3853676 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 16305944 kB
LowFree: 7664060 kB
SwapTotal: 24002632 kB
SwapFree: 24002632 kB
Dirty: 2916 kB
Writeback: 0 kB
AnonPages: 2311056 kB
Mapped: 1009616 kB
Slab: 204472 kB
PageTables: 128076 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 32105428 kB
Committed_AS: 7620100 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 275584 kB
VmallocChunk: 34359461883 kB
HugePages_Total: 49
HugePages_Free: 49
HugePages_Rsvd: 0
Hugepagesize: 2048 kB

ipl315root#cat /proc/buddyinfo
Node 0, zone DMA 3 4 2 4 2 2 1 0 2 0 2
Node 0, zone DMA32 14 7 4 1 1 0 0 1 0 2 771
Node 0, zone Normal 1118 311 76 14 5 395 241 119 65 20 1024

ipl315root#iostat
Linux 2.6.18-92.el5 (ipl315) 02/05/2011

avg-cpu: %user %nice %system %iowait %steal %idle
3.74 0.00 1.69 1.62 0.00 92.95

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
cciss/c0d0 41.08 684.18 501.66 3413860 2503127
cciss/c0d1 85.31 1734.63 689.47 8655254 3440216

I'm attaching the logs . Can anyone please analyse this issue
10 REPLIES
Alzhy
Honored Contributor

Re: Server hang

It is likely your server is having a fit of sudden memory shortage. Maybe your App has a mis-configuration and a process (or a gang of them) all of a sudden comes up rapidly depleting memory.

How long has this been up and were there any changes? DO you have performance monitoring enabled? If not and you do not have any of those fancy monitoring add ons - I sugegst you install SAR (install sysstat package) and set it up to gather daily and per 5 minute stats gathering. You know how to set up and use sar - right?

If no, it is easy:

install sysstat package (yum or etc..)
set up it up in cron for 5 minute gathering:

cd /etc/cron.d
vi sysstat

One set up, it should be gathering and storing performance historicals for up to a month.

To view the current memory historicals for the day:

sar -r

To view SAR memory stats last 5th of the curent month:

sar -f /var/log/sa/sa05 -r

To View memory stats real time in 5 second intervals for up to 10 intervals:

sar -r 5 10

So set SAR up and see if you have progressive Memory/Virtual Memory decay. SAR should log the current day's stats even after you've rebooted your server. It should have captured the stats.

Hakuna Matata.
dawn_jose85
Frequent Advisor

Re: Server hang

From this logs can we find the root cause of this issue?
How it happend ? will it because of application memory usage . CAn we find out anything from this log itself
Alzhy
Honored Contributor

Re: Server hang

Yes -- very evident.

I suspect settings in your App/DB that uses way too much emory for the 16GB RAM the server has.

It may also be processes are spawning up to max the memory is exhasuted...

Bottom line: If you do not have SAR stats enabled -- you can't know where to start or have proof. SAR stats are always a good START.


Also -- if this thing just started happening and it's been stable before -- then it is likely a ercent change.

Hakuna Matata.
dawn_jose85
Frequent Advisor

Re: Server hang

Hi,
i'm also suspecting oracle. How can i use the SAR facillity . how can i check whether the utility is installed and properly working in my server.
What are the steps ?
Alzhy
Honored Contributor

Re: Server hang

3 posts up ^^
Hakuna Matata.
dawn_jose85
Frequent Advisor

Re: Server hang

i had installed sar package .
then it was found that %memusage is 80%
i'm giving the sar output
11:00:01 AM kbmemfree kbmemused %memused kbbuffers kbcached kbswpfree kbswpused %swpused kbswpcad
11:10:01 AM 3085188 13220756 81.08 673440 6612340 24002632 0 0.00 0
11:20:01 AM 3095792 13210152 81.01 673556 6594972 24002632 0 0.00 0
11:30:01 AM 3072556 13233388 81.16 673640 6598508 24002632 0 0.00 0
11:40:01 AM 3076224 13229720 81.13 673764 6601408 24002632 0 0.00 0
11:50:01 AM 3068256 13237688 81.18 673936 6604576 24002632 0 0.00 0
12:00:01 PM 3067308 13238636 81.19 674056 6597612 24002632 0 0.00 0
12:10:01 PM 3059496 13246448 81.24 674076 6600760 24002632 0 0.00 0
12:20:01 PM 3050072 13255872 81.29 674156 6604268 24002632 0 0.00 0
12:30:01 PM 3042816 13263128 81.34 674348 6607128 24002632 0 0.00 0
12:40:01 PM 3029124 13276820 81.42 674416 6610244 24002632 0 0.00 0
12:50:01 PM 3028480 13277464 81.43 674556 6603168 24002632 0 0.00 0
01:00:01 PM 3018824 13287120 81.49 674640 6606392 24002632 0 0.00 0
01:10:01 PM 3010656 13295288 81.54 674736 6609584 24002632 0 0.00 0
01:20:01 PM 3001644 13304300 81.59 674852 6612848 24002632 0 0.00 0
01:30:01 PM 2992612 13313332 81.65 674972 6616052 24002632 0 0.00 0
01:40:01 PM 2993488 13312456 81.64 675072 6608836 24002632 0 0.00 0
01:50:01 PM 2984544 13321400 81.70 675228 6611964 24002632 0 0.00 0
Average: 3176800 13129144 80.52 670255 6603631 24002632 0 0.00 0
ipl315root#
Will this is an issue . How much %memusage can be permitted ?
Alzhy
Honored Contributor

Re: Server hang

80% is a bit high. Specially if it is at a period of low usage or no low connetions.

SAR will allow you to see what had happened memory wise during a hang as it stores stats in /var/log/sa

Once server is up, you can do just do: sar -r and it will give you memory stats up until the time of hang/reboot.

You may also set up periodic cron job for gather ps -ef stats for you to correlate sar stats and what introduces the memory shotfall..

Hakuna Matata.
dawn_jose85
Frequent Advisor

Re: Server hang

Hi
can you please tell me how to analyse sa file
I'm new to this file . I'm not able to understand the details inside .
dawn_jose85
Frequent Advisor

Re: Server hang

Hi,
Is this sar using only to check the memory usage of different processes. Is there any other uses with sa file and SAR stats?
Alzhy
Honored Contributor

Re: Server hang

You are aware of the "man" pages right?

man sar

It can gather stats on really important system stats thatyou can go back to now that you've set it to gather stats (if you've followed my instructions and have it in cron)

Now AGAIN, what will this be useful for? To speifically ascertain your issue if indeed your server had an OOM (out of memory situation). Why? Because once you force the hing server to reboot, you can refer to whatever it captured in terms of sar logs.

Again (as you will learn from teh sar man pages):

sar -r (will display your current day's stats up to the time you've run the command)

sar -r -f /var/adm/sa17 (will display yesterday's memory stats if today is the 18th)

Hope this helps and how about some Luv if you think this has been helpful and you gained SAR skills.


Alzhy Le'Cruel
Hakuna Matata.