System Administration
cancel
Showing results for 
Search instead for 
Did you mean: 

memory issues causing server hang

 
SOLVED
Go to solution
Sunny Jaisinghani
Trusted Contributor

memory issues causing server hang

Hello All,

my RHEL4 box hanged day before yesterday. I had a chance to look at it before it stopped responding. Some java and oracle processes were torturing the CPU and memory. Before i could release these resources the server stopped responding. The guy onsite had to hardboot the server.
He mentioned he had to fsck on swap FS(lvol1) to bring the server UP.
Below are some logs which indicate that there were some memory issues.
I did not see any SCSI errors.

----does these logs indicate a serious trouble for future??
----what else should i troubleshoot??

Jan 11 16:42:49 renault kernel: kswapd0: page allocation failure. order:0, mode:0x50
Jan 11 16:42:53 renault kernel: [] __alloc_pages+0x28b/0x29d
Jan 11 16:42:53 renault kernel: [] find_or_create_page+0x39/0x72
Jan 11 16:42:53 renault kernel: [] grow_dev_page+0x2a/0x1eb
Jan 11 16:42:53 renault kernel: [] __getblk_slow+0xd5/0xf9
Jan 11 16:42:53 renault kernel: [] __getblk+0x3f/0x49
Jan 11 16:42:53 renault kernel: [] __bread+0x9/0x1e
Jan 11 16:42:53 renault kernel: [] read_block_bitmap+0x29/0x4d [ext3]
Jan 11 16:42:53 renault kernel: [] ext3_new_block+0x189/0x581 [ext3]
Jan 11 16:42:53 renault kernel: [] ext3_alloc_block+0x9/0xb [ext3]
Jan 11 16:42:53 renault kernel: [] ext3_alloc_branch+0x4a/0x25e [ext3]
Jan 11 16:42:53 renault kernel: [] __map_bio+0x34/0xb4 [dm_mod]
Jan 11 16:42:53 renault kernel: [] ext3_get_block_handle+0x1b7/0x276 [ext3]
Jan 11 16:43:26 renault kernel: [] ext3_get_block+0x64/0x6c [ext3]
Jan 11 16:43:31 renault kernel: [] __block_write_full_page+0xd8/0x2ae
Jan 11 16:43:34 renault kernel: [] ext3_get_block+0x0/0x6c [ext3]
Jan 11 16:43:37 renault kernel: [] block_write_full_page+0xa4/0xad
Jan 11 16:43:42 renault kernel: [] ext3_get_block+0x0/0x6c [ext3]
Jan 11 16:43:43 renault kernel: [] ext3_ordered_writepage+0xce/0x13a [ext3]
Jan 11 16:43:45 renault kernel: [] bget_one+0x0/0x6 [ext3]
Jan 11 16:43:47 renault kernel: [] pageout+0x88/0xc5
Jan 11 16:43:49 renault kernel: [] shrink_list+0x209/0x4ea
Jan 11 16:43:51 renault kernel: [] __pagevec_release+0x15/0x1d
Jan 11 16:43:53 renault kernel: [] shrink_cache+0x1ff/0x454
Jan 11 16:43:55 renault kernel: [] shrink_zone+0x8f/0x9e
Jan 11 16:43:55 renault kernel: [] balance_pgdat+0x197/0x2cb
Jan 11 16:43:57 renault kernel: [] kswapd+0xb9/0xbb
Jan 11 16:43:57 renault kernel: [] autoremove_wake_function+0x0/0x2d
Jan 11 16:44:06 renault kernel: Mem-info:
Jan 11 16:44:06 renault kernel: DMA per-cpu:
Jan 11 16:44:07 renault kernel: cpu 0 hot: low 2, high 6, batch 1
Jan 11 16:44:10 renault kernel: cpu 0 cold: low 0, high 2, batch 1
Jan 11 16:44:11 renault kernel: Normal per-cpu:
Jan 11 16:44:12 renault kernel: cpu 0 hot: low 32, high 96, batch 16
Jan 11 16:44:14 renault kernel: cpu 0 cold: low 0, high 32, batch 16
Jan 11 16:44:15 renault kernel: HighMem per-cpu:
Jan 11 16:44:16 renault kernel: cpu 0 hot: low 32, high 96, batch 16
Jan 11 16:44:17 renault kernel: cpu 0 cold: low 0, high 32, batch 16
Jan 11 16:44:19 renault kernel:
Jan 11 16:44:20 renault kernel: Free pages: 704kB (704kB HighMem)
Jan 11 16:44:21 renault kernel: Active:570649 inactive:9253 dirty:16 writeback:6656 unstable:0 free:176 slab:44719 mapped:571245 pagetables:208368
Jan 11 16:44:22 renault kernel: DMA free:0kB min:16kB low:32kB high:48kB active:196kB inactive:4kB present:16384kB pages_scanned:410 all_unreclaimable? yes
Jan 11 16:44:22 renault kernel: protections[]: 0 0 0
Jan 11 16:44:23 renault kernel: Normal free:0kB min:936kB low:1872kB high:2808kB active:185860kB inactive:31000kB present:901120kB pages_scanned:264 all_unre
claimable? no
Jan 11 16:44:26 renault kernel: protections[]: 0 0 0
Jan 11 16:44:27 renault kernel: HighMem free:704kB min:512kB low:1024kB high:1536kB active:2096540kB inactive:6008kB present:3276800kB pages_scanned:0 all_un
reclaimable? no
Jan 11 16:44:28 renault kernel: protections[]: 0 0 0
Jan 11 16:44:29 renault kernel: DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
Jan 11 16:44:31 renault kernel: Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
Jan 11 16:44:32 renault kernel: HighMem: 48*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 704kB
Jan 11 16:44:33 renault kernel: Swap cache: add 225863514, delete 225830289, find 30533854/57772199, race 626+1803
Jan 11 16:44:35 renault kernel: 0 bounce buffer pages
Jan 11 16:44:36 renault kernel: Free swap: 4227380kB
Jan 11 16:44:37 renault kernel: 1048576 pages of RAM
Jan 11 16:44:38 renault kernel: 622504 pages of HIGHMEM
Jan 11 16:44:39 renault kernel: 206102 reserved pages
Jan 11 16:44:40 renault kernel: 5185907 pages shared
Jan 11 16:44:43 renault kernel: 33225 pages swap cached
Jan 11 17:52:07 renault kernel: kswapd0: page allocation failure. order:0, mode:0x50
Jan 11 17:52:14 renault kernel: [] __alloc_pages+0x28b/0x29d
Jan 11 17:52:14 renault kernel: [] find_lock_page+0x1d4/0x1d9
Jan 11 17:52:14 renault kernel: [] find_or_create_page+0x39/0x72
Jan 11 17:52:14 renault kernel: [] grow_dev_page+0x2a/0x1eb
Jan 11 17:52:14 renault kernel: [] __getblk_slow+0xd5/0xf9
Jan 11 17:52:14 renault kernel: [] __getblk+0x3f/0x49
Jan 11 17:52:14 renault kernel: [] __bread+0x9/0x1e
Jan 11 17:52:14 renault kernel: [] read_block_bitmap+0x29/0x4d [ext3]





================After reboot================

Jan 12 09:10:14 renault kernel: BIOS-provided physical RAM map:
Jan 12 09:10:14 renault kernel: BIOS-e820: 0000000000000000 - 00000000000a0000 (usable)
Jan 12 09:10:14 renault kernel: BIOS-e820: 0000000000100000 - 00000000cffa8000 (usable)
Jan 12 09:10:14 renault kernel: BIOS-e820: 00000000cffa8000 - 00000000cffb7c00 (ACPI data)
Jan 12 09:10:14 renault kernel: BIOS-e820: 00000000cffb7c00 - 00000000d0000000 (reserved)
Jan 12 09:10:14 renault kernel: BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
Jan 12 09:10:14 renault kernel: BIOS-e820: 00000000fe000000 - 0000000100000000 (reserved)
Jan 12 09:10:14 renault kernel: BIOS-e820: 0000000100000000 - 0000000230000000 (usable)
Jan 12 09:10:14 renault kernel: Warning only 4GB will be used.
Jan 12 09:10:14 renault kernel: Use a PAE enabled kernel.
Jan 12 09:10:14 renault kernel: 3200MB HIGHMEM available.
Jan 12 09:10:14 renault syslog: klogd startup succeeded
Jan 12 09:10:14 renault kernel: 896MB LOWMEM available.
Jan 12 09:10:14 renault kernel: found SMP MP-table at 000fe710
Jan 12 09:10:14 renault kernel: Using x86 segment limits to approximate NX protection
Jan 12 09:10:14 renault kernel: zapping low mappings.
Jan 12 09:10:14 renault kernel: DMI 2.4 present.
Jan 12 09:10:14 renault kernel: ServerWorks chipset detected. Disabling timer routing over 8254.
Jan 12 09:10:14 renault irqbalance: irqbalance startup succeeded
Jan 12 09:10:14 renault kernel: ACPI: PM-Timer IO Port: 0x808
Jan 12 09:10:14 renault kernel: ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Jan 12 09:10:14 renault kernel: Processor #0 6:15 APIC version 20
Jan 12 09:10:14 renault kernel: ACPI: LAPIC (acpi_id[0x02] lapic_id[0x04] enabled)
Jan 12 09:10:14 renault kernel: Processor #4 6:15 APIC version 20
Jan 12 09:10:14 renault kernel: WARNING: NR_CPUS limit of 1 reached. Processor ignored.
Jan 12 09:10:14 renault kernel: ACPI: LAPIC (acpi_id[0x03] lapic_id[0x01] enabled)

Jan 12 09:10:14 renault kernel: highmem bounce pool size: 64 pages
Jan 12 09:10:14 renault kernel: Total HugeTLB memory allocated, 0


free
total used free shared buffers cached
Mem: 3369896 3325700 44196 0 24188 1656060
-/+ buffers/cache: 1645452 1724444
Swap: 12910584 230332 12680252


Thanks
Sunny
4 REPLIES
Matti_Kurkela
Honored Contributor
Solution

Re: memory issues causing server hang

> Jan 11 16:42:49 renault kernel: kswapd0: page allocation failure. order:0, mode:0x50

(a kernel stack trace follows)

Looks like your system was critically low on "normal" and/or "DMA-capable" memory. The kernel could not find even a single free page of memory while running some ext3 filesystem code. When that happens, the kernel starts looking for pages it can reclaim. Apparently it found some.

The stack trace may look scary, but it just allows the kernel developers to pin-point exactly what the kernel was doing when the error was detected. Sometimes it's useful, here it doesn't seem to be important.

After the reboot, everything looks normal, except for two things:

> Jan 12 09:10:14 renault kernel: Warning only 4GB will be used.
> Jan 12 09:10:14 renault kernel: Use a PAE enabled kernel.

Apparently your system is now running a kernel which can handle at most 4 GB of memory (the structural limit of 32-bit systems without PAE technology). Your system seems to have more than that, but with the current kernel, you're limited to 4 GB.

> Jan 12 09:10:14 renault kernel: WARNING: NR_CPUS limit of 1 reached. Processor ignored.

You are running a multi-processor or multi-core system with a single-processor kernel. Looks like your system has two processors/cores, but you're now using only one.

Solution: install the "kernel-smp" package from the RHEL 4 distribution if it isn't already installed. If you need it, install the matching "kernel-smp-devel" package too. It supports both multiple processors and PAE, so it will fix both of your problems.

Or perhaps your onsite guy simply chose the wrong kernel from the GRUB boot menu when rebooting the system?

Check /boot/grub/grub.conf to make sure the SMP kernel is set as default. Then reboot the system to make it use the SMP (=multi-processor) kernel.

"fsck on swap FS" sounds strange, unless you're using a filesystem swap. On a swap partition/LV there is normally no filesystem, so there is nothing to fsck. If a swap partition has errors, the fix is to re-run "mkswap" on it (just like when starting to use it) before activating it with the "swapon" command.

MK
MK
Sunny Jaisinghani
Trusted Contributor

Re: memory issues causing server hang

Hello Matti,

Thanks for the detailed description.

You were right; there are 3 kernels
2.6.9-42.0.0.0.1.ELhugemem
2.6.9-42.0.0.0.1.ELsmp
2.6.9-42.0.0.0.1.EL

The onsite guy booted 2.6.9-42.0.0.0.1.EL kernel.

I'll reboot with SMP kernel. So the CPU limit and memory limit wont be a problem then.
Sunny Jaisinghani
Trusted Contributor

Re: memory issues causing server hang

what is the difference between

2.6.9-42.0.0.0.1.ELhugemem and
2.6.9-42.0.0.0.1.ELsmp

Which one is more suitable for my hardware.
Matti_Kurkela
Honored Contributor

Re: memory issues causing server hang

2.6.9-42.0.0.0.1.ELhugemem
2.6.9-42.0.0.0.1.ELsmp
2.6.9-42.0.0.0.1.EL

EL = the default, single-processor kernel. Can use up to 4 GB of memory, total. Optimized for small systems.

ELsmp = multi-CPU kernel. Supports up to 16 GB of memory (using the PAE technology) and multiple CPUs.

ELhugemem = can support multiple CPUs and up to 64 GB of memory (the maximum allowed by the PAE technology). Optimized for "huge" systems (at the time of the introduction of RHEL 4; they don't seem so huge today).

If you need more than 64 GB of memory, you must install the 64-bit version (x86_64) of the OS. Switching from the 32-bit version (what you have now) to 64-bit will require OS re-installation.

Although a 32-bit OS with PAE can handle up to 64 GB, it is less efficient than a real 64-bit OS and limits the maximum size of individual processes to 4 GB. So I would definitely recommend using a 64-bit OS instead of relying on PAE if you have more than, say, 16 GB of memory.

By the way, the current RHEL kernel versions are 2.6.9-89.0.19.EL*. Your 2.6.9-42* versions are pretty old.

MK
MK