HPE 9000 and HPE e3000 Servers
cancel
Showing results for 
Search instead for 
Did you mean: 

rp5470 random halts

 
SOLVED
Go to solution
Timo J
Frequent Advisor

rp5470 random halts

Anyone to interpret tombstone information on following problem?

One of our rp5470 servers halts randomly between 1-3 weeks. Nothing on syslog, nothing on /var/adm/crash (maybe because RAM=6G and /var has only 2.5G free space available.). Only clue that might help, is tombstone. Last tombstone included.
N/A
16 REPLIES 16
Torsten.
Acclaimed Contributor

Re: rp5470 random halts

There is nothing attached.

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
Timo J
Frequent Advisor

Re: rp5470 random halts


Uhm sorry. Now with better luck.
N/A
Matti_Kurkela
Honored Contributor
Solution

Re: rp5470 random halts

Have you checked the GSP logs?

If you have a LAN cable connected to the GSP, logon to the GSP, type the Ctrl-E - c - f keystroke and press Ctrl-B to access the GSP.

If you're using the local serial console, just press Ctrl-B.

There are various GSP commands that might be useful in this case:

- PS (Power Status) can be used to verify that all your system's PSUs and fans are OK.

- SL displays hardware-level error logs, one message at a time. Type SL, then E for Error log, press Enter for no filtering and then view the messages one at a time.

- CL displays the console log, i.e. what has been sent to the console terminal in recent past. It's most useful when viewed immediately after the crash/reboot: if the kernel displayed a panic message, you can review it. Note that the console log is a ring buffer: when new data is added by any actions on the system console, the oldest data is thrown away.

MK
MK
F Verschuren
Esteemed Contributor

Re: rp5470 random halts

I am not a hartware engenier but what I do know if that if your ts99 looks like this you will need to have a hartware engenier...

going true the ts?? file I see 4 procesors and tree are having no valit time stamp, do you use normaly all 4 cpu's?

mits
Respected Contributor

Re: rp5470 random halts

Your TS file has the timestamp Mon Sep 17 03:33:03 GMT 2007. It is too old. So I assume your system did not halt with HPMC. As Matti Kurkela suggested, you need to capture hardware logs to understand your problem. Also you can describe how the system haled. Did the system just halt? Or was its DC power shut down?
Bill Hassell
Honored Contributor

Re: rp5470 random halts

There won't be anything in syslog for system crashes or power failures because there is no OS running at that moment. Look at /etc/shudownlog for some clues. When you say "halts", do you mean that the machine stops running? Check the lights and the GSP/MP port to see if it has really halted. Or do you mean that it stopped and rebooted automatically? In this case, a system crash may have taken place. Check /etc/rc.config.d/savecrash to make sure SAVECRASH=1 is uncommented. It does not matter if /var is too small. Withe SAVECRASH=1, a partial dump will be created in /var/adm/crash.


Bill Hassell, sysadmin
Timo J
Frequent Advisor

Re: rp5470 random halts

Yep, ts99 was little bit old. The timestamp of the file itself got updated when I powered up the system.

Last entries on GSP Error log are included. Looks like there were some problems with DC power, though the system was up & running at least couple of hours after 16:17:29 when that DC error was logged.

In fact the system didn't halt (cleanly) as I told on initial post. It just fell down so fast that it couldn't wrote anything to syslog etc. As Mits thought, it might have been just DC power shut down. So next job is to identify faulty part...

N/A
Steve Post
Trusted Contributor

Re: rp5470 random halts

I've had this happen. The UPS is running fine. It is set up to periodically test the batteries. But the batteries are dead. So when it tests those, the system crashes. To add insult to injury it does not complain about the dead batteries, even though that was the purpose of the test. (an old K570)

Here's another one, the UPS is running fine. The cable between the UPS and the server doesn't run so hot. When the serial connection drops, so does the power. (this one was from the early 1990's).

Third one. The UPS and server are fine. I plug in a laptop into the UPS with a serial cord. The UPS doesn't like the serial cord you use. It powers off and takes the server with it. (this happened to me 3 months ago....ow).
Michael Steele_2
Honored Contributor

Re: rp5470 random halts

Dear Mikko:

There are a lot of places to look at when you have a panic. Tombstones is only one place and its there for recording HPMCs which is a CPU related problem. You don't have any HPMCs.

/etc/shutdownlog will give you an idea of whether or not the panic was O/S or HW related. So paste the last lines of this file for evaluation.

GSP error logs can be the most valuable. Another responder has already gone over this. This is usually where most admins start. Your looking for ALERT levels, i.e., ALERT LEVEL 10, 12, etc. Get this data.

There are also the /etc/opt/esmon logs. You should check these as well. Persistence, register, etc. There's a half a dozen that you should cross reference by time stamp of panic.
Support Fatherhood - Stop Family Law
Matti_Kurkela
Honored Contributor

Re: rp5470 random halts

Note that the timestamps in the GSP log are in UTC timezone.

So the loss of power would have happened in 16:17:29 UTC = 18:17:29 Finnish local time.

MK
MK
Timo J
Frequent Advisor

Re: rp5470 random halts

Summary:

- Matti: ok with the timestamps, last entry on syslog was at 18:00:43.

- /etc/shutdownlog: last entry was intentional reboot few months ago.

- /var/tombstones: nothing relevant

- syslog: nothing relevant

- /etc/opt/resmon/log: nothing relevant

- GSP errors about DC voltage

And to clarify, system didn't halt like the way of 'shutdown -h', instead of that it just died as fast as if I'd pulled the power cords off.

This system is not high priority so it's not under HP support contract. That's why I'm trying to solve the problem here. But now it's starting to look like I had to call HP to do some HW diagnostics.
N/A
Matti_Kurkela
Honored Contributor

Re: rp5470 random halts

You kind of said it yourself - "it just died as fast as if I'd pulled the power cords off".

Have you already excluded the possibility of a power black-out? Maybe a circuit breaker was tripped, then reset? Or if you had an electrician working on-site, maybe there was a little "oops"...?

The GSP processor is not particularily fast - it does not need to be. It gets information from diagnostic buses, which may have a very low data rate. The GSP has NVRAM and some capacitors (or a coin-cell battery) that allow it to store a message about loss of power. But if the entire server around the GSP suddenly loses power, the GSP's internal power is not going to be enough to query what's happening on the AC side of the PSUs.

The GSP error message indicates that the DC power inside the server was not at the proper level to keep the machine running. As a rp5470 has multiple PSUs, one would not expect all of them to fail simultaneously unless there is something wrong with the incoming AC power.

When just one of the power cords of a rp54xx series server is disconnected, the resulting GSP log message looks like this:

ALERT LEVEL: 6 = Boot possible, pending failure - action required

SOURCE: 4 = power
SOURCE DETAIL: 4 = high voltage DC power SOURCE ID: 0
PROBLEM DETAIL: A = failed or disconnected

CALLER ACTIVITY: 4 = monitor STATUS: F
CALLER SUBACTIVITY: 04 = low voltage power supply
REPORTING ENTITY TYPE: 2 = power monitor REPORTING ENTITY ID: 00

MK
MK
Michael Steele_2
Honored Contributor

Re: rp5470 random halts

A) Regarding no entry in /etc/shutdownlog: 90% of the time this indicates a HW problem and not an O/S problem.

B) To check your power module status use 'PS' from GSP as well as the system status also from the GSP 'SS'.

PS : Power Status- display the status of the Power Management Module
This command displays on the console the status of the power management module.
The firmware revision listed is the power management module firmware.

SS : System Status of proc.

Support Fatherhood - Stop Family Law
Timo J
Frequent Advisor

Re: rp5470 random halts


There's another rp5470 connected to same power source as this problem host and it has been running ok for at least two months and never had experienced same kind of problems as other host. So I think that power black-out is out of question in this case.

Also SS & PS reports on GSP are ok.
N/A
Michael Steele_2
Honored Contributor

Re: rp5470 random halts

Well this is odd. Your missing something. Next time it happens though take a crash dump, run through q4 and send it up to hp for analysis.

savecrash -rf /dir

Since you're not getting any automatic crashdumps use the above command to dump what's not been overwritten from the command line. Verify that you're set up for dumping properly.

lvlnboot -d

crashconf

Don't know what else to suggest. Sorry.
Support Fatherhood - Stop Family Law
Steve Post
Trusted Contributor

Re: rp5470 random halts

I'll say it again.... U. P. S. Actually the batteries in the UPS.