HPE 9000 and HPE e3000 Servers
cancel
Showing results for 
Search instead for 
Did you mean: 

Rp2470 Keeps rebooting

 
SOLVED
Go to solution
TYP3R
Frequent Advisor

Rp2470 Keeps rebooting

Hi there

We got an Rp2470 server thats keeps on rebooting, There's no messages on syslog.log, Rc.log, Demegs that incidate what causes the reboot but i have got the GSP output that may help you guys, To me it seems that it's the Processor that's at fault but can you guru confirm with me if this is the case. The outputs are:

Log Entry # 1 :
SYSTEM NAME: dvdb01-web
DATE: 06/24/2008 TIME: 10:01:17
ALERT LEVEL: 2 = Non-Urgent operator attention required

SOURCE: 0 = unknown, no source stated
SOURCE DETAIL: 0 = unknown, no source stated SOURCE ID: FF
PROBLEM DETAIL: 0 = no problem detail

CALLER ACTIVITY: 6 = machine check STATUS: 2
CALLER SUBACTIVITY: 51 = implementation dependent
REPORTING ENTITY TYPE: 0 = system firmware REPORTING ENTITY ID: 00

0x0000002000FF6512 00000000 00000000 type 0 = Data Field Unused
0x5800082000FF6512 00006C05 180A0111 type 11 = Timestamp 06/24/2008 10:01:17
Type CR for next entry, - CR for previous entry, Q CR to quit.




Log Entry # 2 :
SYSTEM NAME: dvdb01-web
DATE: 06/24/2008 TIME: 08:14:50
ALERT LEVEL: 13 = System hang detected via timer popping

SOURCE: 1 = processor
SOURCE DETAIL: 1 = processor general SOURCE ID: 0
PROBLEM DETAIL: 4 = timeout

CALLER ACTIVITY: F = display_activity() update STATUS: 0
CALLER SUBACTIVITY: 00 = implementation dependent
REPORTING ENTITY TYPE: E = HP-UX REPORTING ENTITY ID: 00

0x78E000D41100F000 00000003 0000000A type 15 = Activity Level/Timeout
0x58E008D41100F000 00006C05 18080E32 type 11 = Timestamp 06/24/2008 08:14:50





Log Entry # 7 :
SYSTEM NAME: dvdb01-web
DATE: 06/23/2008 TIME: 11:21:51
ALERT LEVEL: 12 = Software failure

SOURCE: 1 = processor
SOURCE DETAIL: 1 = processor general SOURCE ID: 0
PROBLEM DETAIL: 0 = no problem detail

CALLER ACTIVITY: B = system panic STATUS: 0
CALLER SUBACTIVITY: 00 = implementation dependent
REPORTING ENTITY TYPE: E = HP-UX REPORTING ENTITY ID: 01

0xA0E010C01100B000 00000000 000005E9 type 20 = major change in system state
0x58E018C01100B000 00006C05 170B1533 type 11 = Timestamp 06/23/2008 11:21:51
Type CR for next entry, - CR for previous entry, Q CR to qui

Thanks

William
9 REPLIES 9
Stefan Stechemesser
Honored Contributor

Re: Rp2470 Keeps rebooting

Hi,

Log Entry #7 means "Software failure", a dump should have been written to /var/adm/crash. If not then I would strongly suggest to examine the console log (on the GSP use the "cl" command) and watch out for the panic sting which is helpful to determin the cause.

Log Entry #2 is a "hang" which is logged if the OS does not send a hartbeat to the GSP for some time. It is usually an indication that the system hangs or crashed.

Log Entry #1 means that a TOC (transfer of control) has happened. Either someone pressed the TOC button or "tc" was issued on the MP (some software like Service Guard also issues a TOC in a hang sitiuation).

=> => all this looks more like a software system panic or hang (which of course can be caused by a hardware problem like a bad root disk etc.)

Check /var/adm/crash, /etc/shutdownlog and the console log with the GSP "cl" command to find out what happened here.
TYP3R
Frequent Advisor

Re: Rp2470 Keeps rebooting

OK.. The server doesn't keep reboot as i mention earlier, What it does is, It hangs when it runs apps, So what we got did is to run STM on memory/CPU and then it crashed, All we could do is TOC it, I've attached the Rc.log, Shutdown.log and stm_info_log, If you can look through it and tell me what hardware/Software is the cause that would be great

William
TYP3R
Frequent Advisor

Re: Rp2470 Keeps rebooting

RC, Shutdown and STM log all in one, From the STM log there is no vaild time stamp on both CPU ? Does this mean that both cpu is at fault as they both dont have a vaild time stamp ?

Thanks

William
Andrew Rutter
Honored Contributor

Re: Rp2470 Keeps rebooting

hi,

Its also worth checking the GSP version installed as there was issues with timer popping and systems rebooting with the earlier versions.

The updates seemed to fix this, and there is reference to this in the .txt file for the patches

post your version and whether it is a A,B or C revision GSP

login to GSP and type he it will list it at the top of the page

Andy
TYP3R
Frequent Advisor

Re: Rp2470 Keeps rebooting

Heres the revision of the GSP

Hardware Revision A0 Firmware Revision C.02.14

William
Stefan Stechemesser
Honored Contributor
Solution

Re: Rp2470 Keeps rebooting

Hi,

"no valid timestamp" in the stm information on a CPU means that no HPMC (high priority machine check) has happened on that CPU since it was installed. A HPMC could be a direct hint on a hardware problem, but not nessecarily on that CPU. The logs would have to be analyzed by HP support.

The /etc/shutdownlog shows that you had several system panics ("Software Failure"):
A "panic" means, the hardware (firmware) did not see any error, but to Operating System (HPUX) found something unexpected and uncorrectable. The panic is normally followed by a memory dump ("man savecrash") and finaly a reset.

10:00 Tue May 27 2008. Reboot after panic: Data page fault
15:05 Tue May 27 2008. Reboot after panic: Break instruction trap
16:08 Fri Jun 20 2008. Reboot after panic: Illegal instruction trap
12:34 Mon Jun 23 2008. Reboot after panic: Break instruction trap

The panic string is not very specific and someone would have to analyze the memory dumps under /var/adm/crash to find the cause of the System Panic.

If you are really able to reproduce this problem with the STM excerciser, then it is possible that a miscalculating CPU is responsible for the panics (this is only one of many possible root causes).
To verify this you could disable one of the CPUs (reboot the server and disable it in the BCH configuration menue, CPU will be disabled after a 2nd reset then). If then the excerciser runs fine, you know that this CPU was causing the problem. But this is only possible if
a) you have two CPUs
b) the problem is fully reproducable with STM

I would strongly suggest to open a case with HP support for a dump analysis.

best regards

Stefan
TYP3R
Frequent Advisor

Re: Rp2470 Keeps rebooting

Hi Gurus

OK.. I have a tombstone from the server mention and it seems that theres no time stamp for the CPU, What i got told is that if there's no time stamp on the CPU it would indicate that the CPU is at fault, Can any1 comfirm this with me

Thanks

William
Stefan Stechemesser
Honored Contributor

Re: Rp2470 Keeps rebooting

Hi William,

"no valid timestamp" is the normal status you see on a CPU that never had a HPMC. This is a normal status.

I think the one who told you that this is strange ment that it is a hint on a defect CPU if after a HPMC ONLY ONE CPU has "no valid timestamp" but all others have actual timestamps.
This is the only case were the "no valid timestamp" could be a hint that the CPU is faulty (f.e. because it was halted before the HPMC happened causing a timeout or CPU bus error in a transaction).
But even in this case, the hint has to be verified (f.e. have this CPU be replaced after the HPMC which would cause the same output, was it really a timeout or CPU bus error etc.).

In you example files, both CPUs have no valid timestamp and this is a proove that no HPMC has happened.

=> the tombstone file can be ignored. Concentrate on the memory dump.

Unfortunately, the missing HPMC does not mean that the CPUs are OK. Some errors (register errors f.e.) are not detected by firmware (except in selftests or CPU diagnostic tools) and may cause system panics like you have, but this only happens in rare cases. Without a dump analysis we cannot say what really happened.

best regards

Stefan
TYP3R
Frequent Advisor

Re: Rp2470 Keeps rebooting

One of the memory was faulty, thats why it keeps on crashing when exercise was ran