HPE 9000 and HPE e3000 Servers
1752786 Members
5918 Online
108789 Solutions
New Discussion юеВ

Re: Panic Reboot

 
Prefect
Occasional Contributor

Panic Reboot

Hi

We have a cluster running two nodes and a single package

Two day ago a problem occurred and the package moved from node 1 to node 2.Then after 2 hours node 2 restarted.


When I reviewed the system logs I found the following

Node2 has the following in the shutdown log


01:46 Tue Nov 21 2006. Reboot after panic: , isr.ior = 0'a617fe80.c0000000'c02000c


The ts99 contained the following

HP-UX NSS1 B.11.11 U 9000/800 796833775

Return Errors:

----------------- Processor 0 HPMC Information - PDC Version: 45.11 ------

Timestamp = Mon Nov 20 23:38:15 GMT 2006 (20:06:11:20:23:38:15)

HPMC Chassis Codes

Chassis Code Extension
------------ ---------
0xe800035c00e00000 0x00000000007fd16c
0x57000f7300e00000 0x8040004000000000
0xf600105e00e00000 0x000000003f900000
0x140003b200e00000 0x000000000000000b
0x5600109b00e00000 0x000000000002a024


General Registers 0 - 31
00-03 0000000000000000 00000000018ce000 0000000000000000 0000000041ec2040
04-07 000000000d696380 00000000007fd138 0000000000b95fb8 0000000000b95fa0
08-11 0000000000b95fb0 8000000000000000 0000000000000018 0002aa8be5b5b4ed
12-15 0000000000f66500 00000000018d1268 0000000000000000 0000000000000000
16-19 0000000000000000 0000000000000000 0000000000000000 000000000d697280
20-23 0000000000000001 ffffffffa0030000 ffffffffa003000c 0000000000000000
24-27 0000000041e29c00 000000000000040e 0000000041e235c0 0000000000cb7578
28-31 ffffff00ffffffff 0000000000ac9870 0000000000ac98a0 00000000018ce000



Control Registers 0 - 31
00-03 0000000055b6f955 0000000000000000 0000000000000000 0000000000000000
04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000
08-11 000000000000e35f 000000000000ef26 00000000000000c0 0000000000000027
12-15 0000000000000000 0000000000000000 000000000002a000 c400000000000000
16-19 0002aa8beec04bee 0000000000000000 00000000007fd16c 000000000ec01073
20-23 00000000a617fe80 c00000000c02000c 000000ff0804ff1b 0000000000000200
24-27 00000000018ce000 0000000000000000 00000000018ce000 000000004004e848
28-31 000000000000000b 0002aa8beeaf68f3 000000000016e850 000000001938e290


Space Registers 0 - 7
00-03 0000000007e16c00 00000000034f2800 000000000cf9ac00 0000000000000000
04-07 0000000000000000 000000000cf9ac00 0000000005652400 0000000000000000


IIA Space (back entry) = 0x0000000000000000
IIA Offset (back entry) = 0x00000000007fd170
Check Type = 0x20000000
Cpu State = 0x9e000000
Cache Check = 0x00000000
TLB Check = 0x00000000
Bus Check = 0x00000000
Assists Check = 0x00000000
Assist State = 0x00000000
Path Info = 0x00000000
System Responder Address = 0x0000000000000000
System Requestor Address = 0x0000000000000000



Floating Point Registers 0 - 31
00-03 0800002000000000 0000000000000000 0000000000000000 0000000000000000
04-07 000000000000000a 00016b49d2f1a9fc 3ff0000000000005 00000000000003e8
08-11 3f8605816058160c 40f6b49000000000 0000000000000000 0000000000002710
12-15 0000000000000000 0000000000000000 0000000000000000 0000000000000000
16-19 0000000000000000 0000000000000000 0000000000000000 0000000000000000
20-23 0000000000000000 0000000000000000 00000000000003e8 4057400000000000
24-27 0000000000000000 3fe661333ee03145 3f90ecdbf5257500 3ff00e37d84d0f36
28-31 3febdac1ea8bddf6 21552df500000017 0000000021552de9 4177301ff8590b29


PIM Revision = 0x0000000000000001
CPU ID = 0x0000000000000014
CPU Revision = 0x0000000000000032
Cpu Serial Number = 0x44d48618543f0107
Check Summary = 0x8040004000000000
SAL Timestamp = 0x0000000045623c67
System Firmware Rev. = 0x00000b4b0000119f
PDC Relocation Address = 0x000000003f900000
Available Memory = 0x00000001ffe00000
CPU Diagnose Register 2 = 0x3212026000002228
MIB_STAT = 0x0040000000200000
MIB_LOG1 = 0x0000000000555500
MIB_LOG2 = 0x0000800000000000
MIB_ECC_DATA = 0x1010a6c41010aac0
ICache Info = 0x0000000000000000
DCache Info = 0x0000000000000000
Sharedcache Info1 = 0x0000000000000000
Sharedcache Info2 = 0x0000000000000040
MIB_RSLOG1 = 0x0000000000000004
MIB_RSLOG2 = 0x0000010000000000
MIB_RQLOG = 0x02081800bfff1510
MIB_REQLOGa = 0x8000000000000200
MIB_REQLOGb = 0x01000aa400000000
Reserved = 0x0000000000000000
Cache Repair Detail = 0x0000000000000000

PIM Detail Text:



-------------- Memory Error Log Information --------------

No errors logged for this bus

------------ I/O Module Error Log Information ------------

IO Subsystem Log Entries

Found 3 PCI Comp errors
Found 1 PCI Bus error
------------------------------------------------

Detail display of IO subsystem log entries
------------------------------------------

PCI Component Error information

PCI Component Error 1
--- Section Header ---
GUID
data1 0xe429faf6
data2 0x3cb7
data3 0x11d4
datat4 0xbc a7 0 80 c7 3c 88 81
REVISION 0x0200
ERROR_RECOVERY_INFO 0x80
SECTION_LENGTH 0x00000188
VALIDATION_BITS 0x0000000000000023
PCI_COMP_ERROR_STATUS 0x0000000000573000
PCI_COMP_INFO 0x0000000000000000 0x2312107704000300
Vendor Id/Device Id: 0x2312/1077
Base Class/Sub Class/Program Interface: 0x03/0/4
Segment/Bus/Device/Function: 0x0/41/4/0
PCI_COMP_MEM_NUM 0
PCI_COMP_IO_NUM 0
PCI_COMP_REGS_DATA_PAIR
Address Data
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
PCI_COMP_OEM_DATA_STRUCT
--- Section Header ---
GUID
data1 0x4f7d86a
data2 0x598b
data3 0x4a0a
data4 0xaa 62 ff 70 73 46 67 4d
LENGTH 232
PHYSICAL_LOCATION 0x000000ffff03ff85
REGISTRATION_NUMBER 0x0000000000000009
CONFIG_REGISTERS_DATA
Offset Size Data
0 8 0xc330014723121077
8 8 0x000040200c040003
16 8 0xa003000400004001
24 8 0x0000000000000000
32 8 0x0000000000000000
40 8 0x12c7103c00000000
48 8 0x00000044a0000000
56 8 0x0040010000000000
76 8 0x29634120004e5407
0 0 0x0000000000000000
0 0 0x0000000000000000
0 0 0x0000000000000000

End of PCI Component Error Information for Error 1

PCI Component Error 2
--- Section Header ---
GUID
data1 0xe429faf6
data2 0x3cb7
data3 0x11d4
datat4 0xbc a7 0 80 c7 3c 88 81
REVISION 0x0200
ERROR_RECOVERY_INFO 0x80
SECTION_LENGTH 0x00000188
VALIDATION_BITS 0x0000000000000023
PCI_COMP_ERROR_STATUS 0x0000000000573000
PCI_COMP_INFO 0x0000000000000000 0x01a7101404000300
Vendor Id/Device Id: 0x1a7/1014
Base Class/Sub Class/Program Interface: 0x03/0/4
Segment/Bus/Device/Function: 0x0/40/1/0
PCI_COMP_MEM_NUM 0
PCI_COMP_IO_NUM 0
PCI_COMP_REGS_DATA_PAIR
Address Data
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
PCI_COMP_OEM_DATA_STRUCT
--- Section Header ---
GUID
data1 0x4f7d86a
data2 0x598b
data3 0x4a0a
data4 0xaa 62 ff 70 73 46 67 4d
LENGTH 232
PHYSICAL_LOCATION 0x000000ffff03ff85
REGISTRATION_NUMBER 0x000000000000000a
CONFIG_REGISTERS_DATA
Offset Size Data
0 8 0xe330014601a71014
8 8 0x0001402006040003
16 8 0x0000000000000000
24 8 0x6220414140414140
32 8 0x0001fff1a000a000
40 8 0x0000000000000000
48 8 0x0000008000000000
56 8 0x000300ff00000000
128 8 0x0033400800c39007
136 8 0x0020002000200020
0 0 0x0000000000000000
0 0 0x0000000000000000

End of PCI Component Error Information for Error 2

PCI Component Error 3
--- Section Header ---
GUID
data1 0xe429faf6
data2 0x3cb7
data3 0x11d4
datat4 0xbc a7 0 80 c7 3c 88 81
REVISION 0x0200
ERROR_RECOVERY_INFO 0x81
SECTION_LENGTH 0x00000188
VALIDATION_BITS 0x0000000000000023
PCI_COMP_ERROR_STATUS 0x0000000000491000
PCI_COMP_INFO 0x0000000000000000 0x01a7101404000300
Vendor Id/Device Id: 0x1a7/1014
Base Class/Sub Class/Program Interface: 0x03/0/4
Segment/Bus/Device/Function: 0x0/60/1/0
PCI_COMP_MEM_NUM 0
PCI_COMP_IO_NUM 0
PCI_COMP_REGS_DATA_PAIR
Address Data
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000
PCI_COMP_OEM_DATA_STRUCT
--- Section Header ---
GUID
data1 0x4f7d86a
data2 0x598b
data3 0x4a0a
data4 0xaa 62 ff 70 73 46 67 4d
LENGTH 232
PHYSICAL_LOCATION 0x000000ffff02ff85
REGISTRATION_NUMBER 0x000000000000000a
CONFIG_REGISTERS_DATA
Offset Size Data
0 8 0x0230014601a71014
8 8 0x0001402006040003
16 8 0x0000000000000000
24 8 0x2220616140616160
32 8 0x0001fff1b000b000
40 8 0x0000000000000000
48 8 0x0000008000000000
56 8 0x000300ff00000000
128 8 0x0033600800c39007
136 8 0x0020002000200020
0 0 0x0000000000000000
0 0 0x0000000000000000

End of PCI Component Error Information for Error 3

End of PCI Component Error Information
PCI Bus Error information

PCI Bus Error 1
--- Section Header ---
GUID
data1 0xe429faf4
data2 0x3cb7
data3 0x11d4
data4 0xbc a7 0 80 c7 3c 88 81
REVISION 0x0200
ERROR_RECOVERY_INFO 0x84
SECTION_LENGTH 0x00000108
VALIDATION_BITS 0x00000000000007ef
PCI_BUS_ERROR_STATUS 0x0000000000b93000
PCI_BUS_ERROR_TYPE 0x0000000000000000
PCI_BUS_ID 0x0000000000000040
PCI_BUS_ADDRESS 0x0000000000400860
PCI_BUS_DATA 0x0000000000000000
PCI_BUS_CMD 0x000000000000000c
PCI_BUS_REQUESTOR_ID 0x0000000000400800
PCI_BUS_COMPLETER_ID 0x00000000fed24000
PCI_BUS_TARGET_ID 0x0000000000400860
PCI_BUS_OEM_ID 0x0000000000b320d8
Bus OEM Data
CEC Header:
--- OEM Data Header ---

GUID
data1 0x9fe64482
data2 0xa02d
data3 0x4ef7
data4 0xad e6 c6 63 59 62 53 99

--- OEM Data Body ---

CELL_NUMBER 0
SBA_NUMBER 0
ROPE_NUMBER 2
--- Mercury Info ---
ERROR_STATUS 0x0000020100000233
ERROR_MASTER_ID_LOG 0x0000000000000000
INBOUND_ERR_ADDRESS 0x0000000000000000
INBOUND_ERR_ATTRIBUTE 0x0000000000000000
COMPLETION_MESSAGE_LOG 0x0000000000000000
OUTBOUND_ERR_ADDRESS 0x80000000004008e0
ERROR_CONFIG 0x0000000000000030
STATUS_INFO_CONTROL 0x0000000000000000
FUNC_ID 0x03b00146122e103c
CAPABILITIES_LIST 0x0f00023700200002
AGP_COMMAND 0x0000000000000000
PCIX_CAPABILITIES 0x0013ff0000010007
OLR_CONTROL 0x00023f9d00032403
CLOCK_CONTROL 0x0000000000000008
BUS_MODE 0xa1aa64ed27357ce0

End of PCI Bus Error Information for Error 1

End of PCI Bus Error Information

Return Warnings:



Return Revisions:

FRU INFORMATION

Module Revision
------ --------
PA 8800 CPU Module 3.2 PA 8800 CPU Module 3.2
Board Info!
Format Version : 0x1 Language Code : 0x0
Mfg Date : Mfg Name : JABIL
Product Name : rp3440 SYSTEM BOARD
Serial Number : 52JAPE4524000806
Part Number : A7136-60001
Fru File Tp/Len : 0x1 Fru File :
Revision : A1 Eng Date Code : 4510
Artwork Rev : D Fru Info :


Could you please advice on the cause of the failure.
7 REPLIES 7
Steven E. Protter
Exalted Contributor

Re: Panic Reboot

Shalom,

this is a critical failure of an important component, because it is HPMC, High Priority Machine Check.

This means CPU, or memory module or other crticial component.

You need to have HP Hardware come out and do the final diagnosis.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Coolmar
Esteemed Contributor

Re: Panic Reboot

Looks like you have a hardware error. Did the server come back up? If so, are errors still being reported?
I would open a call with HP right away and give them all the information that you reported in this thread.
Alex Glennie
Honored Contributor

Re: Panic Reboot

It would be useful to know the h/w involved here ....

At a guess I'd say the node affected is an RP3XXX or RP4XXX system in which case a decode of the ts99 yields ...

starting analysis

Problem [0x13] CCO, Rope2 LBA, FE: LBA observed PERR# asserted while driving out PIO, DMA read split completion, or P2P write data

Possible Cause :
Bad PCI device(s)
Possible Fix :
Replace PCI device(s) hosted by rope 2 (slot 3).

Possible Cause :
Internal LBA error
Possible Fix
Replace I/O card cage.

as stated already this looks like a H/w issue addvice contact HP H/W support to confirm.
Bharath PSB
Advisor

Re: Panic Reboot

It is clear that a P2P write operation caused a fatal error on your LBA connected to rope#2. A Normal restart of the partition would clear the errors logged. After doing a reboot, collect PIM and Error logs which are needed for debugging the issue.

HP is coming up with a new solution called PCI Error Recovery to localize the errors in situations like this. This will not cause the reboot of the partition instead the rope alone would be deconfigured.
mits
Respected Contributor

Re: Panic Reboot

Alex,

It appears you have some special tool for decoding TS99. How did you get that tool?

Mits
Prashanth.D.S
Honored Contributor

Re: Panic Reboot

Hi,

Will it be possible for you to attach the actual ts99 output ??

I suspect the issue to be with the CPU 0 but would like to analyse the actual ts99 file..one which is pasted is not enough.

Best Regards,
Prashanth
Michael Steele_2
Honored Contributor

Re: Panic Reboot

I'm kind of surprised no one asked you to investigate your crash dump. Which is something HP will need to accuratly diagnose the problem. Refer to /var/adm/crash or wherever your default savecrash diriectory has been designated. Also, its not to late to try and recover the failed node's crash dump, especially if it's been inactive ( and it has, the other node has the package ).

http://docs.hp.com/en/J2237-90005/ch06s05.html?btnNext=next%A0%BB

And then the FRU information ( field replacement unit ) indicates a cpu on the type of system board for a rp3440, note the system board product number A7136:

FRU INFORMATION

Module Revision
------ --------
PA 8800 CPU Module 3.2 PA 8800 CPU Module 3.2
Board Info!
Format Version : 0x1 Language Code : 0x0
Mfg Date : Mfg Name : JABIL
Product Name : rp3440 SYSTEM BOARD
Serial Number : 52JAPE4524000806
Part Number : A7136-60001
Fru File Tp/Len : 0x1 Fru File :
Revision : A1 Eng Date Code : 4510
Artwork Rev : D Fru Info :
Support Fatherhood - Stop Family Law