HPE 9000 and HPE e3000 Servers
1753396 Members
7262 Online
108792 Solutions
New Discussion юеВ

superdome sd16a cpu problem

 
SOLVED
Go to solution
freedom_1
New Member

superdome sd16a cpu problem

Hi,

I got a superdome sd16a ,which reports errors about CPU ,do i need to replace the CPU ,and how can determine the location ( Cab 0 Cell 0 CPU 4 )

thank you.

CURRENT MONITOR DATA:

Event Time..........: Sun Apr 26 09:45:23 2009
Severity............: CRITICAL
Monitor.............: fpl_em
Event #.............: 1698
System..............: zhkf1

Summary:
Machine check type could not be determined.


Description of Error:

The Reporting Entity CPU experienced a trap that has caused an asynchronous
branch to the machine check handler, but CPU logs do not indicate that an HPMC,
LPMC or TOC has occurred. The data field will contain the CPU Check Summary.
This Check Summary is described in the return value description for
CpuProcessMachineCheck in PA-8800 CPU Library Application

Probable Cause / Recommended Action:


Contact HP Support. Save event list and Processor HPMC PIM for analysis by lab.
-


Additional Event Data:
System IP Address...: 133.224.202.13
Event Id............: 0x49f3bcb300000000
Monitor Version.....: A.01.00
Event Class.........: System
Client Configuration File...........:
/var/stm/config/tools/monitor/default_fpl_em.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 1
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number.............: 9000/800/SD16A
EMS Version.....................: A.04.20
STM Version.....................: A.45.00
System Serial Number............: SGH443838R
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/fpl_em.htm#1698

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v


IPMI event hex: 0xf7800c9604e00000 000000000000000000
Time Stamp: Sun Apr 26 01:11:05 2009
Event keyword: ERR_CHECK_FALL_THROUGH
Alert level name: Fatal
Reporting vers: 1
Data field type: Status return from function call
Decoded data field:
Reporting entity ID: 4 ( Cab 0 Cell 0 CPU 4 )
Reporting entity Full Name: System Firmware
IPMI Event ID : 3222 (0xc96)



#[/var/opt/resmon/log]parstatus
Warning: No action specified. Default behaviour is display all.
[Complex]
Complex Name : Complex 2
Complex Capacity
Compute Cabinet (4 cell capable) : 1
Active GSP Location : cabinet 0
Model : 9000/800/SD16A
Serial Number : SGH443838R
Current Product Number : A6113A
Original Product Number : A6113A
Complex Profile Revision : 1.0
The total number of Partitions Present : 1

[Cabinet]
Cabinet I/O Bulk Power Backplane
Blowers Fans Supplies Power Boards
OK/ OK/ OK/ OK/
Cab Failed/ Failed/ Failed/ Failed/
Num Cabinet Type N Status N Status N Status N Status GSP
=== ============ ========= ========= ========== ============ ======
0 SD16A 4/ 0/ N+ 5/ 0/ N+ 4/ 0/ N+ 2/ 0/ N active

Notes: N+ = There are one or more spare items (fans/power supplies).
N = The number of items meets but does not exceed the need.
N- = There are insufficient items to meet the need.
? = The adequacy of the cooling system/power supplies is unknown.

[Cell]
CPU Memory Use
OK/ (GB) Core On
Hardware Actual Deconf/ OK/ Cell Next Par
Location Usage Max Deconf Connected To Capable Boot Num
========== ============ ======= ========= =================== ======= ==== ===
cab0,cell0 active core 8/0/8 32.0/ 0.0 cab0,bay1,chassis3 yes yes 0
cab0,cell1 active base 8/0/8 32.0/ 0.0 - no yes 0
cab0,cell2 active base 8/0/8 32.0/ 0.0 - no yes 0
cab0,cell3 active base 8/0/8 32.0/ 0.0 cab0,bay0,chassis3 yes yes 0

[Chassis]
Core Connected Par
Hardware Location Usage IO To Num
=================== ============ ==== ========== ===
cab0,bay0,chassis0 absent - - -
cab0,bay0,chassis1 absent - - -
cab0,bay0,chassis2 absent - - -
cab0,bay0,chassis3 active yes cab0,cell3 0
cab0,bay1,chassis0 absent - - -
cab0,bay1,chassis1 absent - - -
cab0,bay1,chassis2 absent - - -
cab0,bay1,chassis3 active yes cab0,cell0 0

[Partition]
Par # of # of I/O
Num Status Cells Chassis Core cell Partition Name (first 30 chars)
=== ============ ===== ======== ========== ===============================
0 active 4 2 cab0,cell0 Partition 0
zhkf1#[/var/opt/resmon/log]parstatus -C
[Cell]
CPU Memory Use
OK/ (GB) Core On
Hardware Actual Deconf/ OK/ Cell Next Par
Location Usage Max Deconf Connected To Capable Boot Num
========== ============ ======= ========= =================== ======= ==== ===
cab0,cell0 active core 8/0/8 32.0/ 0.0 cab0,bay1,chassis3 yes yes 0
cab0,cell1 active base 8/0/8 32.0/ 0.0 - no yes 0
cab0,cell2 active base 8/0/8 32.0/ 0.0 - no yes 0
cab0,cell3 active base 8/0/8 32.0/ 0.0 cab0,bay0,chassis3 yes yes 0
4 REPLIES 4
Khairy
Esteemed Contributor
Solution

Re: superdome sd16a cpu problem

hi,

take out the cell board and put on a mate.

on the right of the cell boards, you will have all memory dimm slots.

The cpus are on the right side. The placing would be as follow.

+---------------------------------+
|cpu-0 cpu-1 memory-slots .......|
|cpu-2 cpu-3 memory-slots........|
+---------------------------------+

From what I see, this cell board contain 4 X dual core proc. So, i'm guessing, 4th proc is in cpu-3 slot.

Rgds

PS: Really appreciate if you cld assign points.

Michael Steele_2
Honored Contributor

Re: superdome sd16a cpu problem

Hi

If you read the event desciption...

Event 1698
Severity: CRITICAL
Event Summary: Machine check type could not be determined.
Event Class: System
Problem Description:
The Reporting Entity CPU experienced a trap that has caused an asynchronous branch to the machine check handler, but CPU logs do not indicate that an HPMC, LPMC or TOC has occurred. The data field will contain the CPU Check Summary. This Check Summary is described in the return value description for CpuProcessMachineCheck in PA-8800 CPU Library Application Cause / Action:
cause:Contact HP Support. Save event list and Processor HPMC PIM for analysis by lab. action:-
Automated Recovery: None
Event Generation Threshold: 1 occurrence


...you'll note that there is no HW failure.

Also, having been through a few failed Superdome cell boards, I wouldn't do it myself unless you know what you're doing. Right now you have one vPar affected, incorrect replacement of the cell board, i.e., separating it into two unbolted units for an easy fit, will result in bringing down the whole NPar.

You in a producution environment?
Support Fatherhood - Stop Family Law
freedom_1
New Member

Re: superdome sd16a cpu problem

Thanks a lot ,It's a producting environment,and i get the alarm three times in a week.I have replaced the CPU located in cpu-2 slot.I hope it will help the problem.I will reply if it works.
freedom_1
New Member

Re: superdome sd16a cpu problem

it's about more than a month after the cpu replacement,and the cpu works well,

thanks all.