Re: RHEL 5.5 hung on ProLiant DL380 G6

Jdamian · ‎03-30-2011

Hi

My ProLiant DL380 G6 server (running RHEL 5.5) hung last night.

No network connection was avaliable.

This morning I tried to log in via remote console: I typed the word "root", pressed the Enter key and waited for a password request but it hadn't... I had to restart the system.

The message shown in console (before trying login) looked a kernel trace. The following strings were included in the message:

wait_for_completion
default_wake_function
:cciss:start_io
:cciss:cciss_ioctl
do_lookup
dput
kobject_get
exact_lock
:cciss:do_ioctl
:cciss:cciss_compat_ioctl
blkdev_open
blkdev_open
compat_blkdev_ioctl
compat_sys_ioctl
sysenter_do_call

After booting up I found no error message reported in /var/log/messages.

The BIOS version is

Version: P62
Release Date: 03/01/2010
Firmware Revision: 1.81

Is there any way to get info about this issue?
Is it possible to get a crash dump in these servers?

Thanx in advance

Tim Nelson · ‎03-30-2011

any messages in ILO logs ?

Alzhy · ‎03-30-2011

Only way to establish what happened is to enable system watch ("HangWatch" I think it is called from RedHat Support). Then do a forced crash dump.

What it does is it automated doing SysRq+T (System Traces).

There are KDB's on RHEL support to set up your kernel dump (/etc/kdump.conf) as well as how to capture system traces during hng situations (SysRQ+t) and finally how to force a crash/kernel dump - (SysRQ+t).

Hakuna Matata.

Jdamian · ‎03-30-2011

Thanx guys

I didn't know kdump

The only message in the iLO log is found in the reboot

POST Error: 1719 - A controller failure event occurred prior to this power-up

In my original post I didn't explain that my server stopped to log any activity (/var/log/messages, sar, ...) at 21:18 BUT running Oracle daemons (pmon, dbwr,...) started then to log weird traces, i.e., Oracle logs were updated after 21:18 while system logs weren't. As Oracle log files reside in external fibre channel disks I guess this problem was caused by the SCSI disk controller (the string cciss in the trace is another key).

The HP Document "HP ProLiant Servers Troubleshooting Guide" (Part Number: 375445-401 January 2011 Edition: 10) lists the POST codes and their meanings:

1719-Slot X Drive Array - A controller failure event occurred prior to this power-up
(previous lock-up code = 0x####)
Audible Beeps: None
Possible Cause: A controller failure event occurred before the server powered up.
Action: Install the latest version of controller firmware. If the condition persists, then replace the controller.

I'm going to use HP CLI commands to try to get the events and any info about internal RAID and SCSI controller.

Thomas Callahan · ‎03-31-2011

I've seen this before with the CCISS raid controllers. You need to update the firmware to 1.84 or higher on your smartarray controller, and you won't have any more issues.

See below:

Version: 1.84 (26 Aug 2009)
Fixes
- Fixed an issue where a controller failure error 1783 is reported at POST, if a power cycle is initiated during a drive rebuild on a RAID 5 volume.
- Fixed a potential unresponsive condition related to controller lockup error code 0x83.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: RHEL 5.5 hung on ProLiant DL380 G6

RHEL 5.5 hung on ProLiant DL380 G6