Operating System - Linux
1829013 Members
2405 Online
109986 Solutions
New Discussion

rhel as 4 update 1 or 2 random crashes

 
Boris Kulikov
Occasional Contributor

rhel as 4 update 1 or 2 random crashes

Hello!

Random crashes on the few linux x86_64 proliant DL380G4 RHEL AS with update 1 or update 2 without any messages in syslog and without crash dumps. (crash dump not possible).

On both machines latest proliant support pack are installed. I'm use bcm5700 network driver instead of tg3.

See crash log at ilo console in attachment.

While investigating this issue I found:

1) strange entry in dmesg while hpasm start:
(service hpasm start or service hpasm start hpasmd)

Losing some ticks... checking if CPU frequency changed.
warning: many lost ticks.
Your time source seems to be instable or some driver is hogging interupts
rip __do_softirq+0x4d/0xd0

It looks similar with the do_softirq crash entry in the saved ilo crash log.

This messages appear in dmesg after hpasmd daemon start with or without "notaint" option in /opt/compaq/cma.conf.

2) It look strange for me that this user-level daemon can do such a bad thing, but when I do "file" on it - it looks very old and not a 64 bit binary.

file /opt/compaq/hpasmd/bin/hpasmd
/opt/compaq/hpasmd/bin/hpasmd: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, dynamically linked (uses shared libs), stripped

This issue must be investigated and fixed.

Best regards,
Boris
13 REPLIES 13
Shannon_44
New Member

Re: rhel as 4 update 1 or 2 random crashes

Your number 1 is well understood and a report has been sent to Redhat for them to address. This is really not a problem because the kernel catches itself up. There is nothing wrong with the HW, Redhat is just making a bad guess.

Your number 2 is correct. Even on the x86-64 system, hpasmd and all the other agent in the hpasm package are 32bit apps.

As for the randam crashes. Please provide a little more info. It might be helpfull to provide the output of "hplog -v"
Boris Kulikov
Occasional Contributor

Re: rhel as 4 update 1 or 2 random crashes

Hplog -v output contains only strings:
ASR Detected by System ROM
dirk dierickx
Honored Contributor

Re: rhel as 4 update 1 or 2 random crashes

have you tried booting the system but without apic support? add that option to lilo or grub and reboot.

my guess apic is screwing up your system.
Shannon_44
New Member

Re: rhel as 4 update 1 or 2 random crashes

The ASR(Automatic Server Recovery) means that the hpasmd for some reason cannot update the count down timer. Either hpasmd is being killed or the system is hanging. This feature can be turned off in the ROM Based Setup Utility(RBSU) or by "hplog -a DISABLE" at the OS level.

Try turning ASR OFF and see if you notice any system hangs or hpasmd dieing. If you don't see either of those situations then maybe try increasing the timeout to ten minutes. It might be possible that another process is taking up so much CPU time that hpasmd doesn't get a chance to update the counter.
Boris Kulikov
Occasional Contributor

Re: rhel as 4 update 1 or 2 random crashes

No success.
Server crashed with or without ASR.
hpasmd works fine in all cases.
Shannon_44
New Member

Re: rhel as 4 update 1 or 2 random crashes

Do you get the panic when the hpasm package is not installed? What HW devices do you have in the system(ie HBA, Nics, etc)? What SW are you running? Is this a standard Red Hat kernel?

thanks
Shannon
Boris Kulikov
Occasional Contributor

Re: rhel as 4 update 1 or 2 random crashes

- Do you get the panic when the hpasm package is not installed?
- Yes (when not running hpasm packages).
What HW devices do you have in the system(ie HBA, Nics, etc)?
DL380G4 devices and HBA FCA2214. Two CPU, 4 GB memory.

- What SW are you running?
RHEL 4 AS Update 1
HP Proliant Support Pack 7.40
Cyrus IMAP server
DHCP server
Apache WEB server
SQUID proxy server
OpenAFS client and server
- Is this a standard Red Hat kernel?
Yes.
Matti_Kurkela
Honored Contributor

Re: rhel as 4 update 1 or 2 random crashes

I have similar crashes on one of my DL385s.

Details:
Server: Proliant DL385 (G1)
CPU: 1x Opteron 275 dual-core
Memory: 16 GB
OS: RedHat ES4 Update 3 (64-bit x86_64 version)
Kernel version: 2.6.9-34.0.1.ELsmp

After the restart, the server generally runs just fine for a day, then server is usually found hung on the next morning. (Actually not quite hung: it still responds to pings, but does not accept network connections. The console is frozen too.)

Our application development guys seem to be running performance tests at night-time, so server load might be a factor.

The 32-bit versions of RedHat ES4 don't suffer from this problem: for application support reasons, we are running some DL385s with 64-bit RHES4 and some with 32-bit RHES4.

I did some googling on this: based on comments on the Linux-kernel mailing list, it looks like the problem might be with AMD 8111 chipset (inaccurate timer implementation?) and/or the fact that Opteron dynamically changes the CPU frequency according to the needs. The hpasmd daemon might be only marginally related.

In the 32-bit Linux kernel, there are several options for OS real-time clock, because of the history of PC hardware evolution. There are fallback mechanisms in case the "best" available method is found unreliable.

In the x86_64 Linux kernel, some of these fallback mechanisms are different or not implemented... probably because it was assumed that a server with 64-bit CPU would not need to fall all the way back to original IBM PC/AT timer technology :-)

The code in question can be found in Linux kernel source:
/arch/i386/kernel/time.c
and
/arch/x86_64/kernel/time.c

Some of those timing methods need to be aware of CPU speed changes, some use a hardware timer in the Power Management subsystem or something similar.

MK
Matti_Kurkela
Honored Contributor

Re: rhel as 4 update 1 or 2 random crashes

Update: I tried running my DL385 with kernel 2.6.16.20, which should fix the "warning: many lost ticks" problem... and sure enough, I did not see that message anymore.

However, my DL385 keeps crashing still, and the crashes seem to be getting more frequent, regardless of which kernel I'm using.

Now I'm beginning to suspect a faulty CPU.

One of our Proliant DL385s with a 32-bit RHES4 had a similar situation: the CPU seemed fine on low load, but running a performance test brought the machine down consistently within 10 minutes of starting the test. There was no message at all in the hardware log, nor on the console: the system just froze. Changing the CPU fixed the problem.

I'm beginning to think that at least in my case the "many lost ticks" message and the crashes are two separate problems, perhaps completely unrelated to each other.

Based on the similarity with the 32-bit RHES4 case, I've opened a hardware call for my troubled 64-bit RHES4 server. Tomorrow I should have the server up with a new CPU, and then we'll see whether it helps or not...
MK
Matti_Kurkela
Honored Contributor

Re: rhel as 4 update 1 or 2 random crashes

Another update: replacing the CPU did not help in this case, but replacing the motherboard seems to have fixed the problem.

The server has been working for almost a week now. Our developers ran some stress tests on it (while I was busy on other projects), including running a "cpuburn" utility over a weekend.
MK
Jun Yu
Frequent Advisor

Re: rhel as 4 update 1 or 2 random crashes

Hi Boris,

I've checked your attachment and here are some suggestions:

1.co-work with HP service , double check the server's hardware status, especially the processors, memory and System ROM setting.

2. you didn't mentioned that your linux is x86-64. If not, use it instead. RHEL4U2(x86-64) is very stable on our hp servers. it should be the same on yours.

3. don't install anything except the linux OS. No HP PSP, no additional driver and application.

4. stress test the above "clean" server with certain tools such as LTP kit, or you can run stress test by individual tools(processor, IO, memory subsystem, disk system...)

I've handled lots of similar cases during the past two years and use the above steps to quickly isolate the potential issues and troubleshoot the problem.

finally,before you test the OS with those servers, hardware healthy verification is always the first priority task to be done. And most issues are hardware related.

just for fun
Thomas Vertetis
New Member

Re: rhel as 4 update 1 or 2 random crashes

Did anyone every get a resolution to the random reboots?

0002 Critical 15:03 08/09/2006 15:03 08/09/2006 0001
LOG: ASR Detected by System ROM

Redhat ES4 / HP320.

Thanks.

Matti_Kurkela
Honored Contributor

Re: rhel as 4 update 1 or 2 random crashes

My case turned out to be faulty hardware. Replacing the CPU did not fix it, but after replacing the motherboard the machine worked reliably again.

Sorry I could not respond earlier: I had a very busy time and then my summer vacation immediately after that.
MK