HPE 9000 and HPE e3000 Servers
1748199 Members
2601 Online
108759 Solutions
New Discussion юеВ

Re: Sudden Halt on RP4440

 
cam9269
Regular Advisor

Sudden Halt on RP4440

Hi Guys,

Need help in checking what happened to one of our servers, it suddenly halted as per VFP checking. The server was not accessible via telnet/ssh. Luckily we had the MP port running and we were able to start it up from there and see some logs via SL - but only got the following:

----------------------------------------------
143 BMC 2 0x2047C49DA00209A0 0783A37000120300 Type-02 127003 1208323
26 Feb 2008 23:15:44
142 OS 3 *3 0x76800C6803E00980 00000000000005E9 PAT_DATA_FIELD_WARNING
26 Feb 2008 08:46:00
141 OS 3 *3 0x78800C6203E00960 A0E038C01100B000 PAT_ENCODED_FIELD_WARNING
26 Feb 2008 08:46:00
===============================================

When the system booted up I checked /var/tombstones/ts99 which contained the following error msgs:

------------ I/O Module Error Log nformation ------------

IO Subsystem Log Entries

Found 1 PCI Comp error
Found 1 PCI Bus error
------------------------------------------

Detail display of IO subsystem log entries
------------------------------------------

PCI Component Error information

PCI Component Error 1
--- Section Header ---
GUID
data1 0xe429faf6
data2 0x3cb7
data3 0x11d4
datat4 0xbc a7 0 80 c7 3c 88 81
REVISION 0x0200
ERROR_RECOVERY_INFO 0x80
SECTION_LENGTH 0x00000188
VALIDATION_BITS 0x0000000000000023
PCI_COMP_ERROR_STATUS 0x0000000000592000
PCI_COMP_INFO 0x0000000000000000 0x164514e400001500
Vendor Id/Device Id: 0x1645/14e4
Base Class/Sub Class/Program Interface: 0x15/0/0
Segment/Bus/Device/Function: 0x0/a0/2/0
PCI_COMP_MEM_NUM 0
PCI_COMP_IO_NUM 0
PCI_COMP_REGS_DATA_PAIR
Address Data
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
PCI_COMP_OEM_DATA_STRUCT
--- Section Header ---
GUID
data1 0x4f7d86a
data2 0x598b
data3 0x4a0a
data4 0xaa 62 ff 70 73 46 67 4d
LENGTH 232
PHYSICAL_LOCATION 0x000000ffff06ff85
REGISTRATION_NUMBER 0x0000000000000009
CONFIG_REGISTERS_DATA

Offset Size Data
0 8 0x12b00146164514e4
8 8 0x0000c02002000015
16 8 0x00000000d0000004
24 8 0x0000000000000000
32 8 0x0000000000000000
40 8 0x128a103c00000000
48 8 0x0000004000000000
56 8 0x004001eb00000000
64 8 0x0003fff900004807
0 0 0x0000000000000000
0 0 0x0000000000000000
0 0 0x0000000000000000

End of PCI Component Error Information for Error 1

End of PCI Component Error Information
PCI Bus Error information

PCI Bus Error 1
--- Section Header ---
GUID
data1 0xe429faf4
data2 0x3cb7
data3 0x11d4
data4 0xbc a7 0 80 c7 3c 88 81
REVISION 0x0200
ERROR_RECOVERY_INFO 0x84
SECTION_LENGTH 0x00000108
VALIDATION_BITS 0x00000000000007cf
PCI_BUS_ERROR_STATUS 0x0000000000b76000
PCI_BUS_ERROR_TYPE 0x0000000000000006
PCI_BUS_ID 0x00000000000000a0
PCI_BUS_ADDRESS 0x0051800039f47300
PCI_BUS_DATA 0x0000000000000000
PCI_BUS_CMD 0x0000000000000000
PCI_BUS_REQUESTOR_ID 0x0000000000a01000
PCI_BUS_COMPLETER_ID 0x00000000fed2a000
PCI_BUS_TARGET_ID 0x0051800039f47300
PCI_BUS_OEM_ID 0x0000000000b4b458
Bus OEM Data
CEC Header:
--- OEM Data Header ---

GUID
data1 0x9fe64482
data2 0xa02d
data3 0x4ef7
data4 0xad e6 c6 63 59 62 53 99

--- OEM Data Body ---

CELL_NUMBER 0
SBA_NUMBER 0
ROPE_NUMBER 5
--- Mercury Info ---
ERROR_STATUS 0x000002010000023b
ERROR_MASTER_ID_LOG 0x0000000000000008
INBOUND_ERR_ADDRESS 0x0051800039f47300
INBOUND_ERR_ATTRIBUTE 0x4000000000000000
COMPLETION_MESSAGE_LOG 0x0000000000000000
OUTBOUND_ERR_ADDRESS 0x0000000000000000
ERROR_CONFIG 0x0000000000000030
STATUS_INFO_CONTROL 0x0000000000000000
FUNC_ID 0xcab00146122e103c
CAPABILITIES_LIST 0x0f00023700200002
AGP_COMMAND 0x0000000000000000
PCIX_CAPABILITIES 0x0013ff0000010007
OLR_CONTROL 0x00003ff600022400

CLOCK_CONTROL 0x0000000000000038
BUS_MODE 0x91b974ae36d500e4

End of PCI Bus Error Information for Error 1

End of PCI Bus Error Information

FRU INFORMATION

Module Revision
------ --------
PA 8900 CPU Module 3.2
PA 8900 CPU Module 3.2
PA 8900 CPU Module 3.2
PA 8900 CPU Module 3.2

Board Info!
Format Version : 0x1 Language Code : 0x0
Mfg Date : Mfg Name : JABIL
Product Name : augustus baseboard
Serial Number : 52JAPE4448304817
Part Number : A6961-60201
Fru File Tp/Len : 0x1 Fru File : ^P
Revision : A Eng Date Code : 4442
Artwork Rev : A5 Fru Info :
===============================================

We've gotten the machine up but we need to know what caused it, if it is HW, we want to replace it immediately as this is a critical system that needs attention.


Thanks!
12 REPLIES 12
Sameer_Nirmal
Honored Contributor

Re: Sudden Halt on RP4440

Maybe HPMC had occured leading to halt. So h/w problem could be suspected maybe with some PCI card or bus. /var/tombstones/ts99 would need to be analyzed. Log a h/w with HP support since this is critical system.
Stefan Stechemesser
Honored Contributor

Re: Sudden Halt on RP4440

Hi,

the chassis code

141 OS 3 *3 0x78800C6203E00960 A0E038C01100B000 PAT_ENCODED_FIELD_WARNING

was logged by HPUX. The last 4 hexdigits B000 tell us that a system panic has happened.

I would strongly recomend that you check if a dump was written to /var/adm/crash and let it be analyzed by HP support. The "INDEX" file (and also /etc/shutdownlog) contains the panic string that could help understanding what has happened.

Regarding the HPMC ts99: You did only log a part of it (the I/O error log). You should check the timestamp if it is really related to the system panic or has happened a long time ago (during every reboot, a new ts99 is written with the contents of NVRAM which is never been cleared.).

The PCI bus error simply means that a parity error has happened. Physical location is PCI slot 6, hw path 0/5... .
Slot 5 and 6 are shared slots, so I would reseat those cards if the HPMC timestamp is actual.

HP support could do a deeper analysis ...
cam9269
Regular Advisor

Re: Sudden Halt on RP4440

Hi Guys, thanks for taking time to check my post. Your inputs are appreciated.

I was able to run crashinfo on the dumps generated when the server panicked for the 2nd time, and from there CPU0 was panicking - or at least it was how I understood the crashinfo output. I just can't get this to HP at the moment because of support contract issues.

Am attaching the crashinfo output. What I did was disable this CPU from the system and had it started with 1 at the moment until we have the replacement parts arrive.

But the thing is, I ran some test on the CPU using 'mstm' and they all passed without errors (Exercise and Information tasks). I just wasn't able to execute 'Diagnose' - it does not execute when I choose 'RUN' from its options.

Additional inputs would be very much welcome.

Thanks guys!
KVK Vijay
New Member

Re: Sudden Halt on RP4440

Better to replace the Faulty CPU 0

Because of this

========================
= Processor Clock Info =
========================

hardclock_late = 0
itick_per_tick = 9998289
lbolt = 6418887 (0x61f1c7)

event mpi rpb delta clk eiem eirr PSW
cpu type timeinval interval timer secs:ticks od 0,4 0,4 I
--- ----- ------------------ ------------------ ---------- --- ---- ---- ---
0 PANIC 0x0 0x3a7f1fa140d5 -64328:-67 0 0 0 1 0 0 ---> In ticks field we can check
1 TOC 0x3a7f22bea390 0x3a7f22b09845 0:0 0 1 1 0 0 1
2 TOC 0x3a7f429af924 0x3a7f42915708 0:0 0 1 1 0 0 1
cam9269
Regular Advisor

Re: Sudden Halt on RP4440

Hi Guys,

I replaced CP0 already last Mar4, but the same thing happened today, and this is already causing an alarm. Any other things I need to look at?

TIA!
cam9269
Regular Advisor

Re: Sudden Halt on RP4440

What's really alarming is that I already disabled the panicking CPU (CPU0). But the system still crashed, and this time, it did not create entries in the /etc/shutdownlog file, nor did it create crash dumps in /var/adm/crash, we only saw the message "Halted" on VFP. Any other thoughts guys?


Thanks heaps!
Mridul Shrivastava
Honored Contributor

Re: Sudden Halt on RP4440

I don't think that CPU 0 is the culprit here. I think crashinfo o/p is not showing the correct information. If you have a closer look at it , it shows that PANIC happened on CPU # -1 and TOC on other CPUs 0, 1, 2.

Again having a look at the CPU hpa's there is also some discrepancy:

cpu hpa spu_state
--- --- ---------
0 0xfffffffffe780000 SPU_ENABLED
1 0xfffffffffe781000 SPU_ENABLED
2 0xfffffffffe788000 SPU_ENABLED
3 0x0 SPU_ENABLED

see this CPU # 3 hpa is not shown here ???

I am not very sure, can you post crashinfo -continue o/p ??

Do u have another crashdump, if you could post the o/p of that ??
Time has a wonderful way of weeding out the trivial
cam9269
Regular Advisor

Re: Sudden Halt on RP4440

Thanks for replying Mridul,

Here's a new crashinfo output from the server.
Mridul Shrivastava
Honored Contributor

Re: Sudden Halt on RP4440

Hi,

First crash happened on Dump time Thu Feb 28 04:42:29 2008. After that I see that you mentioned that you disabled one CPU (however we still don't know which one is culprit). But the second crash happened on Dump time Sat Mar 8 03:56:25 2008 and this time also I could see that all the four CPUs were active:

Number of CPU's : 4
Disabled CPU's : 0

So Actually none of them were disabled, i am just curious to know how did u disable them ??

Still this o/p is more or less similar to the previous one:

cpu hpa spu_state
--- --- ---------
0 0xfffffffffe780000 SPU_ENABLED
1 0xfffffffffe781000 SPU_ENABLED
2 0xfffffffffe788000 SPU_ENABLED
3 0x0 SPU_ENABLED



I could find the details for other three CPU's i.e. 0,1,2 so we are missing information for CPU# 3. I suspect that could a culprit. Could you please run cstm on that. Who knows we may get some trace over there....

Best of luck.
Time has a wonderful way of weeding out the trivial