Operating System - Linux
1833612 Members
3706 Online
110062 Solutions
New Discussion

Red Hat and Proliant lock up's

 
Andre ten Bohmer
Occasional Advisor

Red Hat and Proliant lock up's

Hello,

We have several Proliant servers (dl-360-G2/g3 and dl-380-g2/g3) running different Red Hat versions (ES 2.1, AS 2.1, ES 3, 9) which now and then just seem to freeze. Console is black, no network connectivity or what so ever. There seems to be a relation between the kernel version and the firmware because some servers are running ok since installing the latest firmware (system, SCSI controller etc) after lock-ups occurred with a new kernel. These lock-up's occur after a week, a month or sometimes twice a day and there seems to be no relation with the system load or installed software. No hints in the log files or on the console, just dead.
Installing the latest HP management agents did not improve stability only that ASR now automatically reboots the server in case of a frozen state.

Anyone have any suggestions?
TIA,
Andre
23 REPLIES 23
Don_89
Trusted Contributor

Re: Red Hat and Proliant lock up's

Are you using HP memory or aftermarket? What apps are running on these servers? Give us some more info on the hardware/software configurations..

We have a similar mix of hardware of OS (ES,AS 2.1 & 3.0) with no problems of lockups. I've have problems with reboots on ver 7.0 of the HP agents, I recommend using version 6.40..
Andre ten Bohmer
Occasional Advisor

Re: Red Hat and Proliant lock up's

Tnx!
It is standard HP hardware, no third party memory. Some servers are running Oracle, others only Apache or Postfix. We had the same problems with hpasm 6.40. Digging deeper, the problems where/are on Intel P4 Xeon servers with hypher threading enabled. Maybe to rule things out would could disable HT, but are there known issues with HT enabled and HP Linux servers? An other thing the servers share, is that they are running a local netfilter based firewall.
Mogens Kjaer
Frequent Advisor

Re: Red Hat and Proliant lock up's

No suggestions, but maybe we are having
the same problem?

We have three ML370's, running RH9 and
management software 6.40.

Two of the machines have locked up twice,
the third havn't (yet) had any problems.

One or two days before the lockup, I can
see increased CPU load in Big Brother,
and I get the following messages in
/var/log/messages:

Mar 21 10:46:07 server1 kernel: raid5: multiple 0 requests for sector 4812176
Mar 21 13:09:39 server1 kernel: raid5: multiple 1 requests for sector 142849888

These are from the linux software RAID driver.

I don't think the problem is in the
software raid, but these messages are
only indications that something else
is wrong...

The three machines are not completely identical:

server1: 4.5G RAM, 8x146G Disks
server2: 4.5G RAM, 8x146G Disks
mail: 1.5G RAM, 3x146G Disks

server2 is a mirror of server1, and not
used in daily production.

I've seen this lockup occurring on
the two production machines, server1
and mail.

Mogens
Jared Middleton
Frequent Advisor

Re: Red Hat and Proliant lock up's

We had two DL580 G2 systems that froze on random intervals (hours,days,weeks) when we first went live with RH Linux 7.3 Professional a couple years ago. I spent parts of 3 months researching, surfing forums, talking to HP tech support, only to find that there are really any number of different causes for system locking problems, no "silver bullet" exists.

However, we did solve OUR particular problem, which was sort of a version disconnect between hardware BIOS/drivers from HP and the kernel version from RH... so you might just check it anyway.

Go into your system BIOS (obviously need to reboot), check the setting for â MPS Table Modeâ , set the value to â Full Table APICâ if it's not already, reboot again.

Jar
Don_89
Trusted Contributor

Re: Red Hat and Proliant lock up's

I've looked into this a bit more.. Some people have had luck with passing acpi=off to the kernel which disables power management. I would also stop the APMD daemon from running..

The other leading cause of random lockups was SCSI termination issues. I would double check my terminiation on the HD's & controller to make sure everything was in order..

You mentioned that your running netfilter on each of the boxes. I would also check that you are not blocking your loopback interface communicating with internal processes.. Make sure you have something like this in your script..

iptables -A INPUT -i 127.0.0.1 -j ACCEPT
iptables -A OUTPUT -o 127.0.0.1 -j ACCEPT


Good luck!!!
Andre ten Bohmer
Occasional Advisor

Re: Red Hat and Proliant lock up's

Ok, apmd is/was not running but the acpi=off switch is an idea.

SCSI termination issues are not very likely but indeed to rule things out...

Netfilter is configured 'loose' for the loopback interface because we had some problems before with a more strict approach.

The suggestion Jared makes about MPS table mode is also something worth checking, tnx.

Tnx for your time!
Olivier Drouin
Trusted Contributor

Re: Red Hat and Proliant lock up's

I run linux(As 2.1, ES3, 7.2) on a lot of the mentionned servers and didnt have these problems.

If you bought all your servers at the same time maybe you got a bad hardware batch.

It happened to us with the power supplies of dl360 g1.
Vitaly Karasik_1
Honored Contributor

Re: Red Hat and Proliant lock up's

I suggest you to upgrade your RHEL2.1 with Q3 "update pack" - there was a promlem with Broadcom NIC driver.
Danny_78
New Member

Re: Red Hat and Proliant lock up's

We are having the same issue.

Andre, what did you end up doing to fix this issue for yourself?
Ross Minkov
Esteemed Contributor

Re: Red Hat and Proliant lock up's

Andre,

What runlevel do you run these servers at?
If these are real servers make sure you have
id:3:initdefault:
in /etc/inittab. This way you eliminate X alltogether.
Also make sure that you have the latest firmware and PSP.
Can you login through the iLO/RILO card when these lock-ups happen? Is there anything to indicate problems in the IML?

Regards,
Ross

Vitaly Karasik_1
Honored Contributor

Re: Red Hat and Proliant lock up's

Danny, start from upgrading your RHEL to the latest update level.
Andre ten Bohmer
Occasional Advisor

Re: Red Hat and Proliant lock up's

Danny, the problem is not solved yet, sorry. It's since December 2004 an official call at HP Europe. All firmware is at the latest level, w're using the HP broadcom driver, latest PSP but to no avail. Netdump and SysRQ or not functioning when the server hangs so still no crash dump to work on for the support people. One server is running at runlevel 3 (all other indeed at 5) after a hang up (again) last week. ILO is not configured. Memory of one server was swapped by HP to make sure not imitation memory is in use (also last week).
Cheers,
Andre
danny_76
New Member

Re: Red Hat and Proliant lock up's

The servers are running the latest firmware and the latest PSP. Red Hat has been updated to latest software and kernel.

They are being run at run level 3.
Not sure how to connect using the iLO/RILO card. No connection through network or kvm, and all logs stop at time hard lock appears to happen.

When talking to HP they had us install the insight manager agents. This changed the behaviour from lockups to rebooting. We were able to get some dumps, but they have now come back and said it is a software issue. No problem with the hardware.

We are now talking with Red Hat and have sent them a vmcore dump, and they have requested another dump to do some comparisons but we have not got another successful dump yet.

Do your servers run with the smp kernel. Have you tried to run them with the non-smp kernel? Have you tried disabling the hyper-threading?

Thanks.
Dan

Andre ten Bohmer
Occasional Advisor

Re: Red Hat and Proliant lock up's

All servers are indeed running the SMP kernel, once tried without HT but to no avail. Running without SMP is not an option because this is all production servers which depend on enough process power. But we have dozens of servers with the same configuration running more than stable. Last week a HP technician was a witness of a server "hang up" and he was stunned. He modified some BIOS settings: disabled USB and set the interrupt for both NICS on the same value, so lets see what now happens. HP is now in contact with Red Hat regarding this problem, so lets be patient for a few days. This problem is a show stopper regarding moving Oracle from OpenVMS to Linux on HP, so the pressure is on.
Andre
danny_76
New Member

Re: Red Hat and Proliant lock up's

Were the problem servers bought near the same time? Similarly, we have the same servers bought in Nov 2004, without any issues, but the servers we bought in ~ March 2004 are having this issue.
Andre ten Bohmer
Occasional Advisor

Re: Red Hat and Proliant lock up's

No sorry, no connection to what kind of batch of servers. Experienced hang-ups on dl-380-g2, dl-380-g3, dl-360-g3 and ml-530 servers. Some servers got stable after a firmware upgrade (like the ml-530 and some dl-380-g2's), others still go down.
Rob Leadbeater
Honored Contributor

Re: Red Hat and Proliant lock up's

Hi Andre,

Are all the machines that are locking up running Oracle ? What version ? App Server or Database ?

We've seen similar issues on a number of DL380 G3's all running RHEL 3, and Oracle Application Server 10g (9.0.4.0.0)

Every now and then the machines will just lock up - they'll normally drop off the network, but occasionally they'll stay on the network but they can't be logged into, either on the console or via SSH.

If I use iLO to look at the X console, I see the time at which the lock up happened, but there's no response from the keyboard or mouse.

Its very frustrating to say the least !

Cheers,

Rob
Andre ten Bohmer
Occasional Advisor

Re: Red Hat and Proliant lock up's

Hi Rob,
Some of them are running Oracle Database Server (8.1.7) but others only run Apache or Amavis/Spamassasin, some are connected to a MSA1000 SAN others just DAS, so for us there is no clear lead. When a hang-up occurs, the console is totaly black and there is no network connectivity (no ssh, sometimes a ping is possible).
Thanks and cheers,
Andre
Don_89
Trusted Contributor

Re: Red Hat and Proliant lock up's

We've had the exact same problems. Server locks-up and doesn't respond to anything except you can still ping the server. All of the server are running Oracle.

The problem turned out to be the HP Insight Manager agents causing this. I disabled the agents and haven't had a lockup since.

I have an open case with HP for the past 3 months but the technician basically gave up trying to figure out the problem..

Prakash Velayutham
New Member

Re: Red Hat and Proliant lock up's

Hi Andre,

Could you please let me know if these issues have been resolved for you? I am having the same issues with a HP ProLiant DL380 G4 and a G3 server.

G4 server had freezing issues about 6 months back. Then it disappeared for a while. Now beginning last week I have already had 3 freezing incidents. No response from keybd / mouse, ping works but not SSH, etc. Thru' RiLO I can get a console, but can't use my keybd / mouse. I am running SuSE Pro 9.3. No non-HP parts in the system.

G3 system used to have the freezing issue quite regularly till about 2 - 3 months back, but has not happened since. Don't know when it will start again.

The strangest thing is I have another G4 server that has been rock-solid (touch wood when I say this) for the past about 8 months. All the servers are running the same OS version.

TIA,
Prakash
Andre ten Bohmer
Occasional Advisor

Re: Red Hat and Proliant lock up's

Hi Prakash,

Problems seem to be solved, but never had it clear what was causing these lock ups. There is undoubtedly a relation between kernel version and HP firmware. On a dl-360-g3 with Red Hat EL AS 2.1, the lock ups disappeared when we installed kernel 2.4.9-e.59smp (already were on level with firmware). Since March 11â th 2005, all systems are running smoothly. Last lock up was on a dl-380-g3, Red Hat EL AS 3, 2.4.21-32-0.1.ELsmp kernel at that time. The SA 5i controller firmware was not on level and after upgrading from 2.38 to 2.58 the lock ups also disappeared from this server. But with every new kernel released by Red Hat, like now with U7, I cross my fingers because neither HP nor Red Hat explained or could explain to us what was causing the mayhem.

Cheers,
André
Prakash Velayutham
New Member

Re: Red Hat and Proliant lock up's

I have one other question. Did you notice anything with your hard disks during the system freezing scenario? My servers are configured this way.

SCSI ID 0 and 1 have been configured as a RAID 1 mirrored volume and host the following partitions:
/
/boot
swap
/usr/local

SCSI IDs 2, 3, 4 and 5 have been configured as a RAID 0 volume (no loss of space as we have an enterprise backup that takes care of data backup in case of disk going bad) and this hosts /home partition.

What I see during freezing is that the disks 0 and 1 are totally busy (green LED going crazy on these 2 disks) for some reason and my guess is that is the reason the system does not respon to any other requests.

Do you see any of these?

Thanks,
Prakash
Andre ten Bohmer
Occasional Advisor

Re: Red Hat and Proliant lock up's

For the system partitions (like /, /boot etc) we use RAID 1/mirrorring and during locks ups there where no disk lights going crazy. One thing which was consistent behavior on a lock up: all disks activity LEDâ s where burning green without blinking, just burning green. If I can recall well, this was the case for all disks on all logical drives (RAID1, RAID10 or RAID5).
Cheers,
André