HPE 9000 and HPE e3000 Servers
cancel
Showing results for 
Search instead for 
Did you mean: 

Sudden Halt on RP4440

 
cam9269
Regular Advisor

Sudden Halt on RP4440

Hi Guys,

Need help in checking what happened to one of our servers, it suddenly halted as per VFP checking. The server was not accessible via telnet/ssh. Luckily we had the MP port running and we were able to start it up from there and see some logs via SL - but only got the following:

----------------------------------------------
143 BMC 2 0x2047C49DA00209A0 0783A37000120300 Type-02 127003 1208323
26 Feb 2008 23:15:44
142 OS 3 *3 0x76800C6803E00980 00000000000005E9 PAT_DATA_FIELD_WARNING
26 Feb 2008 08:46:00
141 OS 3 *3 0x78800C6203E00960 A0E038C01100B000 PAT_ENCODED_FIELD_WARNING
26 Feb 2008 08:46:00
===============================================

When the system booted up I checked /var/tombstones/ts99 which contained the following error msgs:

------------ I/O Module Error Log nformation ------------

IO Subsystem Log Entries

Found 1 PCI Comp error
Found 1 PCI Bus error
------------------------------------------

Detail display of IO subsystem log entries
------------------------------------------

PCI Component Error information

PCI Component Error 1
--- Section Header ---
GUID
data1 0xe429faf6
data2 0x3cb7
data3 0x11d4
datat4 0xbc a7 0 80 c7 3c 88 81
REVISION 0x0200
ERROR_RECOVERY_INFO 0x80
SECTION_LENGTH 0x00000188
VALIDATION_BITS 0x0000000000000023
PCI_COMP_ERROR_STATUS 0x0000000000592000
PCI_COMP_INFO 0x0000000000000000 0x164514e400001500
Vendor Id/Device Id: 0x1645/14e4
Base Class/Sub Class/Program Interface: 0x15/0/0
Segment/Bus/Device/Function: 0x0/a0/2/0
PCI_COMP_MEM_NUM 0
PCI_COMP_IO_NUM 0
PCI_COMP_REGS_DATA_PAIR
Address Data
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
PCI_COMP_OEM_DATA_STRUCT
--- Section Header ---
GUID
data1 0x4f7d86a
data2 0x598b
data3 0x4a0a
data4 0xaa 62 ff 70 73 46 67 4d
LENGTH 232
PHYSICAL_LOCATION 0x000000ffff06ff85
REGISTRATION_NUMBER 0x0000000000000009
CONFIG_REGISTERS_DATA

Offset Size Data
0 8 0x12b00146164514e4
8 8 0x0000c02002000015
16 8 0x00000000d0000004
24 8 0x0000000000000000
32 8 0x0000000000000000
40 8 0x128a103c00000000
48 8 0x0000004000000000
56 8 0x004001eb00000000
64 8 0x0003fff900004807
0 0 0x0000000000000000
0 0 0x0000000000000000
0 0 0x0000000000000000

End of PCI Component Error Information for Error 1

End of PCI Component Error Information
PCI Bus Error information

PCI Bus Error 1
--- Section Header ---
GUID
data1 0xe429faf4
data2 0x3cb7
data3 0x11d4
data4 0xbc a7 0 80 c7 3c 88 81
REVISION 0x0200
ERROR_RECOVERY_INFO 0x84
SECTION_LENGTH 0x00000108
VALIDATION_BITS 0x00000000000007cf
PCI_BUS_ERROR_STATUS 0x0000000000b76000
PCI_BUS_ERROR_TYPE 0x0000000000000006
PCI_BUS_ID 0x00000000000000a0
PCI_BUS_ADDRESS 0x0051800039f47300
PCI_BUS_DATA 0x0000000000000000
PCI_BUS_CMD 0x0000000000000000
PCI_BUS_REQUESTOR_ID 0x0000000000a01000
PCI_BUS_COMPLETER_ID 0x00000000fed2a000
PCI_BUS_TARGET_ID 0x0051800039f47300
PCI_BUS_OEM_ID 0x0000000000b4b458
Bus OEM Data
CEC Header:
--- OEM Data Header ---

GUID
data1 0x9fe64482
data2 0xa02d
data3 0x4ef7
data4 0xad e6 c6 63 59 62 53 99

--- OEM Data Body ---

CELL_NUMBER 0
SBA_NUMBER 0
ROPE_NUMBER 5
--- Mercury Info ---
ERROR_STATUS 0x000002010000023b
ERROR_MASTER_ID_LOG 0x0000000000000008
INBOUND_ERR_ADDRESS 0x0051800039f47300
INBOUND_ERR_ATTRIBUTE 0x4000000000000000
COMPLETION_MESSAGE_LOG 0x0000000000000000
OUTBOUND_ERR_ADDRESS 0x0000000000000000
ERROR_CONFIG 0x0000000000000030
STATUS_INFO_CONTROL 0x0000000000000000
FUNC_ID 0xcab00146122e103c
CAPABILITIES_LIST 0x0f00023700200002
AGP_COMMAND 0x0000000000000000
PCIX_CAPABILITIES 0x0013ff0000010007
OLR_CONTROL 0x00003ff600022400

CLOCK_CONTROL 0x0000000000000038
BUS_MODE 0x91b974ae36d500e4

End of PCI Bus Error Information for Error 1

End of PCI Bus Error Information

FRU INFORMATION

Module Revision
------ --------
PA 8900 CPU Module 3.2
PA 8900 CPU Module 3.2
PA 8900 CPU Module 3.2
PA 8900 CPU Module 3.2

Board Info!
Format Version : 0x1 Language Code : 0x0
Mfg Date : Mfg Name : JABIL
Product Name : augustus baseboard
Serial Number : 52JAPE4448304817
Part Number : A6961-60201
Fru File Tp/Len : 0x1 Fru File : ^P
Revision : A Eng Date Code : 4442
Artwork Rev : A5 Fru Info :
===============================================

We've gotten the machine up but we need to know what caused it, if it is HW, we want to replace it immediately as this is a critical system that needs attention.


Thanks!
12 REPLIES 12
Sameer_Nirmal
Honored Contributor

Re: Sudden Halt on RP4440

Maybe HPMC had occured leading to halt. So h/w problem could be suspected maybe with some PCI card or bus. /var/tombstones/ts99 would need to be analyzed. Log a h/w with HP support since this is critical system.
Stefan Stechemesser
Honored Contributor

Re: Sudden Halt on RP4440

Hi,

the chassis code

141 OS 3 *3 0x78800C6203E00960 A0E038C01100B000 PAT_ENCODED_FIELD_WARNING

was logged by HPUX. The last 4 hexdigits B000 tell us that a system panic has happened.

I would strongly recomend that you check if a dump was written to /var/adm/crash and let it be analyzed by HP support. The "INDEX" file (and also /etc/shutdownlog) contains the panic string that could help understanding what has happened.

Regarding the HPMC ts99: You did only log a part of it (the I/O error log). You should check the timestamp if it is really related to the system panic or has happened a long time ago (during every reboot, a new ts99 is written with the contents of NVRAM which is never been cleared.).

The PCI bus error simply means that a parity error has happened. Physical location is PCI slot 6, hw path 0/5... .
Slot 5 and 6 are shared slots, so I would reseat those cards if the HPMC timestamp is actual.

HP support could do a deeper analysis ...
cam9269
Regular Advisor

Re: Sudden Halt on RP4440

Hi Guys, thanks for taking time to check my post. Your inputs are appreciated.

I was able to run crashinfo on the dumps generated when the server panicked for the 2nd time, and from there CPU0 was panicking - or at least it was how I understood the crashinfo output. I just can't get this to HP at the moment because of support contract issues.

Am attaching the crashinfo output. What I did was disable this CPU from the system and had it started with 1 at the moment until we have the replacement parts arrive.

But the thing is, I ran some test on the CPU using 'mstm' and they all passed without errors (Exercise and Information tasks). I just wasn't able to execute 'Diagnose' - it does not execute when I choose 'RUN' from its options.

Additional inputs would be very much welcome.

Thanks guys!
KVK Vijay
Occasional Visitor

Re: Sudden Halt on RP4440

Better to replace the Faulty CPU 0

Because of this

========================
= Processor Clock Info =
========================

hardclock_late = 0
itick_per_tick = 9998289
lbolt = 6418887 (0x61f1c7)

event mpi rpb delta clk eiem eirr PSW
cpu type timeinval interval timer secs:ticks od 0,4 0,4 I
--- ----- ------------------ ------------------ ---------- --- ---- ---- ---
0 PANIC 0x0 0x3a7f1fa140d5 -64328:-67 0 0 0 1 0 0 ---> In ticks field we can check
1 TOC 0x3a7f22bea390 0x3a7f22b09845 0:0 0 1 1 0 0 1
2 TOC 0x3a7f429af924 0x3a7f42915708 0:0 0 1 1 0 0 1
cam9269
Regular Advisor

Re: Sudden Halt on RP4440

Hi Guys,

I replaced CP0 already last Mar4, but the same thing happened today, and this is already causing an alarm. Any other things I need to look at?

TIA!
cam9269
Regular Advisor

Re: Sudden Halt on RP4440

What's really alarming is that I already disabled the panicking CPU (CPU0). But the system still crashed, and this time, it did not create entries in the /etc/shutdownlog file, nor did it create crash dumps in /var/adm/crash, we only saw the message "Halted" on VFP. Any other thoughts guys?


Thanks heaps!
Mridul Shrivastava
Honored Contributor

Re: Sudden Halt on RP4440

I don't think that CPU 0 is the culprit here. I think crashinfo o/p is not showing the correct information. If you have a closer look at it , it shows that PANIC happened on CPU # -1 and TOC on other CPUs 0, 1, 2.

Again having a look at the CPU hpa's there is also some discrepancy:

cpu hpa spu_state
--- --- ---------
0 0xfffffffffe780000 SPU_ENABLED
1 0xfffffffffe781000 SPU_ENABLED
2 0xfffffffffe788000 SPU_ENABLED
3 0x0 SPU_ENABLED

see this CPU # 3 hpa is not shown here ???

I am not very sure, can you post crashinfo -continue o/p ??

Do u have another crashdump, if you could post the o/p of that ??
Time has a wonderful way of weeding out the trivial
cam9269
Regular Advisor

Re: Sudden Halt on RP4440

Thanks for replying Mridul,

Here's a new crashinfo output from the server.
Mridul Shrivastava
Honored Contributor

Re: Sudden Halt on RP4440

Hi,

First crash happened on Dump time Thu Feb 28 04:42:29 2008. After that I see that you mentioned that you disabled one CPU (however we still don't know which one is culprit). But the second crash happened on Dump time Sat Mar 8 03:56:25 2008 and this time also I could see that all the four CPUs were active:

Number of CPU's : 4
Disabled CPU's : 0

So Actually none of them were disabled, i am just curious to know how did u disable them ??

Still this o/p is more or less similar to the previous one:

cpu hpa spu_state
--- --- ---------
0 0xfffffffffe780000 SPU_ENABLED
1 0xfffffffffe781000 SPU_ENABLED
2 0xfffffffffe788000 SPU_ENABLED
3 0x0 SPU_ENABLED



I could find the details for other three CPU's i.e. 0,1,2 so we are missing information for CPU# 3. I suspect that could a culprit. Could you please run cstm on that. Who knows we may get some trace over there....

Best of luck.
Time has a wonderful way of weeding out the trivial
cam9269
Regular Advisor

Re: Sudden Halt on RP4440

Hi Midrul,

I think I owe you a better explanation regarding the series of events which took place.

(I think I made a mistake this happened on Jan, actually they are all happening on Feb)

Feb 27 - The server crashed with a HALTED status in the VFP interface, no crash dumps created, no /etc/shutdownlog entries were logged. on a ts99 file got produced, what we did was do an RS on the MP prompt to have the machine started up again. The same time, I configured 'savecrash' to get those dumps working

Feb 28 - the server rebooted producing crash dumps in /var/adm/crash, logging "Reboot after Panic" in /etc/shutdownlog and an /etc/ts99 file, from the crash dumps, I produced a "crashinfo" output, where we got the idea that CPU0 was causing the panic. Since we needed to wait for several days to have the parts arrived I disabled CPU0.

Mar 4 - CPU0 was replaced and enabled again

Mar 8 4:21AM, the server rebooted again with the same crashinfo entries - CPU0 is panicking, so what I did was to disable it again to mitigate any abrupt system crashes.

Mar 8, 17:21 or so, to our surprise, the server HALTED and was inaccessible again (CPU0 is already disabled at this point). This time, no crash dumps were produced and no other log entries created, except for the new ts99 file whose contents are similar to the previous ones.

Mar 9, 17:15, the server had another crash with the VFP having HALTED as the System status again. (CPU0 is still disabled at this point). We're at a loss now as to where to look and how to resolve this.

Here's a cstm output, is this enough of a data to go with?

=============================================
Running Command File (/usr/sbin/stm/ui/config/.stmrc).

-- Information --
Support Tools Manager


Version A.53.05

Product Number B4708AA

(C) Copyright Hewlett Packard Co. 1995-2006
All Rights Reserved

Use of this program is subject to the licensing restrictions described
in "Help-->On Version". HP shall not be liable for any damages resulting
from misuse or unauthorized use of this program.

cstm>sel dev 1
cstm>infolog
-- Converting a (1524) byte raw log file to text. --
Preparing the Information Tool Log for system on path system File ...

.... MASDW01 : 10.56.128.56 ....

-- Information Tool Log for system on path system --

Log creation time: Mon Mar 10 10:52:46 2008

Hardware path: system


System Information for (MASDW01)
HPUX Model Number......: rp4440
HPUX Model String......: 9000/800/rp4440
Original Product Number: A7134BR
Current Product Number.: A7134BR
System Serial Number...: DEH4604BB2
Hversion...............: 0x8940
Sversion...............: 0x491
Software Capabilities..: 0x100000f0

=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=


CPU Information:
Number of CPUs in the system = 4

CPU Map
cpu -----------------------------------------
slot | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 |
state|d |d |caM |ca | | | | |
-----------------------------------------
| 08 | 09 | 10 | 11 | 12 | 13 | 14 | 15 |
state| | | | | | | | |
-----------------------------------------
c - Configured (CPU powered on)
d - De-configured (CPU powered off)
a - Active (configured and processes running)
i - Inactive (configured and idle)
M - Monarch CPU (always Active)
C - Marked for re-configuration (Configured after next boot)
D - Marked for de-configuration (De-configured after next boot)


=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=-+-=

Field Replaceable Unit Identification (FRUID):

=============================================

Thanks!
Mridul Shrivastava
Honored Contributor

Re: Sudden Halt on RP4440

Thanks a lot for the detailed explanation. from the o/p given i could see that you have disabled two CPUs, is that so ?

One more point, I m still not sure but as you mentioned that you have tried disabling and replacing CPU but no difference. Have you ever given a thought to system board ?

There is a possibility that system board itself is faulty??
Time has a wonderful way of weeding out the trivial
cam9269
Regular Advisor

Re: Sudden Halt on RP4440

Actually we have 2 x dual core CPUs that's why you were seing 2 disabled CPUs from cstm. It could be that the board has a problem but I can't decide from all the logs I am seeing. That's why I was seeking for deeper analysis so that we don't go replacing parts blindly.