Integrity Servers
1752802 Members
5721 Online
108789 Solutions
New Discussion юеВ

Re: Superdome(IA64) event.log (Event 100212, 100213)

 
SOLVED
Go to solution
Hyunchul Lee_1
Occasional Contributor

Superdome(IA64) event.log (Event 100212, 100213)

тЧЛ Superdome EMS Event (event.log)
HP ITRC recommed that replacement core I/O in the slot0/cell0. but EMS Events have been
occured that same event.
Please analysing & advising the problem causes.
Events are below...


>------------ Event Monitoring Service Event Notification ------------<

Notification Time: Wed May 17 04:31:19 2006

bsum03 sent Event Monitor notification information:

/system/events/cpe/cperrors is >= 1.
Its current value is MAJORWARNING(3).



Event data from monitor:

Event Time..........: Wed May 17 04:31:19 2006
Severity............: MAJORWARNING
Monitor.............: cpe_em
Event #.............: 100212
System..............: bsum03

Summary:
A Corrected OEM Platform Error was reported.


Description of Error:

A platform error was corrected by the firmware/hardware. The error
occurred in the System Bus Adapater Link Interface of cell (0). More
information is available in the Event Details section of this event.

Probable Cause / Recommended Action:

Contact your HP Support Representative to have the System Bus Adapter
Interfaces checked.

Additional Event Data:
System IP Address...: XXXX
Event Id............: 0x446a288700000000
Monitor Version.....: B.01.00
Event Class.........: CPE
Client Configuration File...........:
/var/stm/config/tools/monitor/default_cpe_em.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 3
Received within...: 1 day(s)
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number.............: ia64 hp superdome server SD32A
EMS Version.....................: A.04.20
STM Version.....................: C.51.00
OS Version......................: B.11.23
System Serial Number............: SGH454900A
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/cpe_em.htm#100212

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v



Error Details:

Error Recovery Info : 0x81

Corrected Platform Error (REO OEM) Record:

Validation Bits: 0x000000000000006d Error Status: 0x0000000000041600
Requestor Id: Not valid Responder Id: 0x0000000000000001
Target Id: 0x0000000000000001 Bus Data: Not valid
OEM Component Id:0x00000000127b103c 000000000000000000

REO OEM data:

Cell Number: 0x0000000000000001 Pri Err: 0x0000000020000001
Sec Err: 0x0000000000000001 FE Err Enable: 0x0000000000fff4f6
Unc Err Enable: 0x0000000000000308 Cor Err Enable: 0x0000000000000001
Data A Syn: 000000000000000000 Data B Syn: 000000000000000000
Data C Syn: 000000000000000000 INB Cmd Stat: 000000000000000000
INB Perr Stat: 000000000000000000 INB IllCmd Stat:000000000000000000
Syn SW Err: 0x010004000000000c Syn Poison: 000000000000000000
Syn Improp: 000000000000000000 Syn MW Err: 000000000000000000
Syn Unrec: 000000000000000000


=============================================================================
Explanation(s):

Error Recovery Info : 0x81
* Error has been corrected
Error Status : 0x0000000000041600
* Bus Parity Error
* On Data Signal or Address transaction


>---------- End Event Monitoring Service Event Notification ----------<

>------------ Event Monitoring Service Event Notification ------------<

Notification Time: Wed May 17 09:01:20 2006

bsum03 sent Event Monitor notification information:

/system/events/cpe/cperrors is >= 1.
Its current value is MAJORWARNING(3).



Event data from monitor:

Event Time..........: Wed May 17 09:01:20 2006
Severity............: MAJORWARNING
Monitor.............: cpe_em
Event #.............: 100213
System..............: bsum03

Summary:
A Corrected Platform Error was reported.


Description of Error:

A platform error was corrected by the firmware/hardware. The error
occurred in the PCI card in slot (0) of cell (0). More information is
available in the Event Details section of this event.

Probable Cause / Recommended Action:

Contact your HP Support Representative to have the System Bus Adapter
Interfaces to the I/O Adapter(s) checked.

Additional Event Data:
System IP Address...: XXXX
Event Id............: 0x446a67d000000000
Monitor Version.....: B.01.00
Event Class.........: CPE
Client Configuration File...........:
/var/stm/config/tools/monitor/default_cpe_em.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 3
Received within...: 1 day(s)
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number.............: ia64 hp superdome server SD32A
EMS Version.....................: A.04.20
STM Version.....................: C.51.00
OS Version......................: B.11.23
System Serial Number............: SGH454900A
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/cpe_em.htm#100213

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v



Error Details:

Error Recovery Info : 0x80

Corrected Platform Error (REO Rope Unit) Record:

Validation Bits: 0x0000000000000061 Error Status: 000000000000000000
Requestor Id: Not valid Responder Id: Not valid
Target Id: Not valid Bus Data: Not valid
OEM Component Id:0x00000000127c103c 000000000000000000

REO Unit data:

Cell Number: 0x0000000000000001 Rope NUmber: 000000000000000000
Pri Err: 000000000000000000 Sev Err: 000000000000000000
FE Err Enable: 0x0000000000000210 Unc Err Enable: 0x0000000000000007
Cor Err Enable: 0x0000000000000180 TLB DVI Err: 000000000000000000
TLB FetchToSyn: 0x000000000000000f TLB RtnInvSyn: 0x000000000000000f
TLB AccErrSyn: 000000000000000000 REGF PERR Syn A:000000000000000000
REGF PERR Syn B: 000000000000000000 TIDF PERR Syn: 000000000000000000
CDF PERR Syn: 000000000000000000 FIP Syn: 000000000000000000
Cdata PERR Syn: 0x00006cb400000000
CdataDiagRDSyn P: 0x00000000000077da
CdataDiagRdSyn D0:0xfff34afeffafb4d5
CdataDiagRdSyn D0:0xd6e77de7fb3ff9b5
Unexp MR Syn: 000000000000000000 RD Rtn Err Syn: 000000000000000000
RU FIFO Syn 1: 000000000000000000 RU FIFO Syn 2: 000000000000000000
RU FIFO Syn 3: 000000000000000000


=============================================================================
Explanation(s):

Error Recovery Info : 0x80
* Error has not been corrected
Error Status : 000000000000000000


>---------- End Event Monitoring Service Event Notification ----------<

6 REPLIES 6
Sameer_Nirmal
Honored Contributor

Re: Superdome(IA64) event.log (Event 100212, 100213)

Hi,

Are you saying, these events are still occuring inspite of replacing the PCI card at slot0/cell0 ?
Worth to take a look at
MP SL logs
efi shell > errdump cpe
efi shell > info -b all

The CPE usually occurs for IO related compoments like SBAs , PCI adapters. As per event 100213, it is related/pointing to PCI card at slot0/cell0 and CPE is not corrected.
Hyunchul Lee_1
Occasional Contributor

Re: Superdome(IA64) event.log (Event 100212, 100213)

Thank you, Sameer Nirmal...
I have replaced that Slot0/Cell0's PCI Card.
but It is not resolve the problem.

Then, HPRC suggest that replace the PCI Card of slot0/Cell1.

HPRC said that according above the detail log, cell0 has no problem, cell1 is main point of this occurance.

So, I will replace the PCI card of Cell1 on the next weekend.

Thank you for your reply.
Stefan Stechemesser
Honored Contributor
Solution

Re: Superdome(IA64) event.log (Event 100212, 100213)

Hi,

you should open a support call with HP Support for this problem.

The error is NOT related to Cell 0, but to Cell 1.

In the details of the EMS message you find:

REO OEM data:

Cell Number: 0x0000000000000001 Pri Err: 0x0000000020000001

This is a known bug from the EMS diagnostics that always reports Cell 0. It will be fixed with the next diagnostic version.

The other thing is that this error is normally NOT caused by a PCI card. It looks more like a (correctable) single wire error on the connection between the PCI card cage and the cell 1.
HP Support will recomend the needed actions to troubleshoot this problem after analyzing all system logs.

This correctable error is normally not serious and will not crash the system, but troubleshooting should be done in the next available downtime ...

best regards

Stefan
Sameer_Nirmal
Honored Contributor

Re: Superdome(IA64) event.log (Event 100212, 100213)

Hi Stefan,

Just seeking clarification here..

As far these 2 events are concerned , the event 100212 is limited to REO OEM data which I guess is a (embedded?) h/w which is SBA link interface in this case. The CPE error is corrected at this device level. In case of event 100213 , the CPE is not corrected at REO Unit ( non-embedded ?) level which is the slot0/cell1.

100212
The error
occurred in the System Bus Adapater Link Interface of cell (0).
REO OEM data
Error Recovery Info : 0x81
* Error has been corrected

100213
The error
occurred in the PCI card in slot (0) of cell (0)
REO Unit data
Error Recovery Info : 0x80
* Error has not been corrected

The event 100213 details does signifies the problem at slot0/cell 1. The pin-pointing at right ( malfunctioning though ) h/w is done through the event description which is good for troubleshooting. Is it possible to verify the culprit with the OEM Component Id ? any relation could be made with FRU data?
Stefan Stechemesser
Honored Contributor

Re: Superdome(IA64) event.log (Event 100212, 100213)

please contact HP support in case of such detailed questions.
Only one hint: If the 2nd (PCI bus) error would have been uncorrectable, then the system would have crashed with an MCA(machine check abort). I don't know why in this case the 2nd set of registers was collected and displayed, but Error Status: 000000000000000000 and Pri Err: 000000000000000000 and Sev Err: 000000000000000000 look like:

There was no error on this component (PCI card). The decoding of the registers in the first message can only be done by HP support.

best regards

Stefan

Alan_152
Honored Contributor

Re: Superdome(IA64) event.log (Event 100212, 100213)

Is all your firmware up to date and matched? Download a copy of the SmartSetup cd and use it to make sure that the 'dome is completely up to date...