1839271 Members
4808 Online
110138 Solutions
New Discussion

Strange lockup issues.

 
Ray Van Dolson
Advisor

Strange lockup issues.

We're using quite a few HP DL140's as PPTP Concentrators -- servers running Fedora Core 1 + the PoPToP package. Our DL140's are dual processor 2.4GHz machines.

We've noticed that randomly, but generally within 7 days of startup, these servers freeze up and have to be reset. There is always a kernel Oops that I have captured with Netdump / serial console:

ksymoops 2.4.9 on i686 2.4.21-20.ELcustom-mppe-20040928.1. Options used
-v /usr/src/linux-2.4/vmlinux (specified)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.21-20.ELcustom-mppe-20040928.1/ (default)
-m /usr/src/linux/System.map (default)

Unable to handle kernel NULL pointer dereference<7>divert: not allocating divert_blk for
non-ethernet device ppp453
00000000
*pde = 38853067
Oops: 0000
CPU: 0
EIP: 0060:[<00000000>] Tainted: P
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010282
eax: e8929000 ebx: e87a7000 ecx: e8965900 edx: c01a61b6
esi: 00000000 edi: e87a7000 ebp: efe72100 esp: c87bfed4
ds: 0068 es: 0068 ss: 0068
Process pptpctrl (pid: 23254, stackpage=c87bf000)
Stack: c01aa5d0 e8929000 00000000 c01a7c55 e87a7000 efa80380 00000005 e87a7000
efe72100 00000004 00000010 c01a45a5 e87a7000 efe72100 00000000 00000000
efe72100 c016db84 efe72100 00000000 c87be000 00000145 c87be000 00000004
Call Trace: [] pty_chars_in_buffer [kernel] 0x32 (0xc87bfed4)
[] normal_poll [kernel] 0x105 (0xc87bfee0)
[] tty_poll [kernel] 0x83 (0xc87bff00)
[] do_select [kernel] 0x230 (0xc87bff18)
[] sys_select [kernel] 0x33c (0xc87bff5c)
Code: Bad EIP value.


>>EIP; 00000000 Before first symbol

>>eax; e8929000 <_end+2841f5e8/38303648>
>>ebx; e87a7000 <_end+2829d5e8/38303648>
>>ecx; e8965900 <_end+2845bee8/38303648>
>>edx; c01a61b6
>>edi; e87a7000 <_end+2829d5e8/38303648>
>>ebp; efe72100 <_end+2f9686e8/38303648>
>>esp; c87bfed4 <_end+82b64bc/38303648>

Trace; c01aa5d0
Trace; c01a7c55
Trace; c01a45a5
Trace; c016db84
Trace; c016deee

We pulled our hair out over this one for weeks... I tried using Red Hat Enterprise ES3, dropping back to Red Hat 7.3, using various versions of the MPPE module we load... nothing worked. I also began using the bcm5700 from HP instead of the built-in tg3 driver that comes with Red Hat. Still the freezes would occur.

Finally I fired up the server in nosmp noapic mode... things got more stable and in fact there weren't any more crashes! Through the grapevine, I heard from others using Broadcom NIC's that they had issues and that noapic mode sometimes solved the problem.

So I rebooted again using only noapic instead of nosmp also... alas the lockups still occurred.

So currently all our DL140's are running in noapic nosmp mode which basically wastes one entire processor. But at least we're stable now.

Anyone have any insight into this? Anything I should try next? Would love to have SMP working again... but I cannot afford to have hundreds of customers getting disconnected at random hours throughout the day. :-)

Thanks...
7 REPLIES 7
Steven E. Protter
Exalted Contributor

Re: Strange lockup issues.

As a suggestion, test the 2.6 Kernel.

Has up2date been run on the boxes?

Thats always a good idea.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Jerome Henry
Honored Contributor

Re: Strange lockup issues.

Have you checked bugzilla on Fedora 1 trouble with smp ? The system hangs with xeon smp, I don't know if it's your harware spec, but if so, you'd better upgrade, it's a kernel issue :
http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=109497
hth

J
You can lean only on what resists you...
Ray Van Dolson
Advisor

Re: Strange lockup issues.

Interesting, hadn't read that Bugzilla bug before. They make it sound like the issue had been addressed in kernel .2188 (we run .2199 on our FC1 boxes) and as I said, we also had the problem on Red Hat Enterprise running 2.4.21-20.EL...

I guess what I need to do is build up a FC2-based concentrator and see if I can reproduce the problem.

The odd thing is that this happens on *every* DL140 we have that is used as a PPTP Concentrator. We also use DL140's for other tasks and they seem to run just fine, the only difference being that these servers only pass a fraction of the traffic that the concentrators do.

Oh, and yes, these systems are all up to date (also against the FC1 new legacy treee).

Thanks for the responses.
Stuart Browne
Honored Contributor

Re: Strange lockup issues.

If the only difference between these DL140's is the 'mppe' kernel patches, then I'd poke the PoPTop guy's and see if they've had SMP issues in the past.

Unfortunately, I've only ever run the mppe stuff in uniprocessor environments.
One long-haired git at your service...
Ray Van Dolson
Advisor

Re: Strange lockup issues.

Oh, I've poked them plenty. :-) One of them pointed out that he'd had issues with his Dell server w/ Broadcom NIC chipset that required nosmp/noapic mode.

Several of them run SMP setups with no issues and I have tried a couple different MPPE modules.

At this point I'm going to try a Fedora Core 2 setup.

That and invest in a lot more remote APC power units for resetting hard locked servers ;)
Stuart Browne
Honored Contributor

Re: Strange lockup issues.

Don't these little beasties reboot themselves? With the DL and ML series we have (running the HPASM stuff), they reboot themselves when they detect a kernel panic/oops.
One long-haired git at your service...
Ray Van Dolson
Advisor

Re: Strange lockup issues.

Don't think the DL140's support hpasm :-(