HPE 9000 and HPE e3000 Servers
1753316 Members
4866 Online
108792 Solutions
New Discussion юеВ

Re: superdome EMS report I/O link error

 
SOLVED
Go to solution
Genesis_1
Occasional Advisor

superdome EMS report I/O link error

Hi, my superdome EMS has reported IO link error for months. But the system has been running normally. I suppose that the reo link cable is defective. The problem descriptions are as follow:

FRU Physical Location: 0x00ffff01ffffff93
FRU Source = 9 (cell)
Source Detail = 3 (coherency controller)
Cabinet Location = 0
Cell Location = 1

RIN_ERR_PRI_MODE..........: 0x0000000000000008
REO input single wire error

CECC_DATA_MSB_0...........: 0x0000000000000383
CECC_DATA_LSB_0...........: 0xc064d030322e0c4e
CECC_DATA_MSB_1...........: 0x0000000000000383
CECC_DATA_LSB_1...........: 0xc064d030322e0c4e


>---------- End Event Monitoring Service Event Notification ----------<

>------------ Event Monitoring Service Event Notification ------------<

Notification Time: Sun Dec 16 11:42:00 2007

dds2 sent Event Monitor notification information:

/system/events/core_hw/core_hw is >= 1.
Its current value is SERIOUS(4).



Event data from monitor:

Event Time..........: Sun Dec 16 11:41:59 2007
Severity............: SERIOUS
Monitor.............: dm_core_hw
Event #.............: 85
System..............: dds2

Summary:
I/O link interface to cell controller recovered errors


Description of Error:

The cell controller (CC) chip has detected and corrected multiple errors
in data transferred to it from the I/O bus adapter (REO) chip to which it
is connected.

Probable Cause / Recommended Action:

The inbound I/O link cable is unreliable.
Contact your HP support representative to check the inbound I/O link
cable.

There may be a problem with the CC chip or cell board.
Contact your HP support representative to check the cell board.

There may be a problem with the I/O backplane.
Contact your HP support representative to check the I/O backplane.

Additional Event Data:
System IP Address...: 10.93.4.12
Event Id............: 0x47649e8800000000
Monitor Version.....: B.01.00
Event Class.........: System
Client Configuration File...........:
/var/stm/config/tools/monitor/default_dm_core_hw.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 3
Associated OS error log entry id(s):
None
Additional System Data:
System Model Number.............: 9000/800/SD32000
OS Version......................: B.11.11
STM Version.....................: A.29.00
EMS Version.....................: A.03.20
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/dm_core_hw.htm#85



There is a error in the HPMC trace file:

10215: ------- Analyzing CC1, RIN_ERR_PRI_MODE CSR:
10216: RIN_ERR_PRI_MODE_CSR = 0x0000000000000008
10217: RIN_ERR_ENABLE_MASK CSR = 0x000000001fffffff
10218: RIN_FE_UPGRADE_CONFIG CSR = 0x0000000000000fc0
10219: RIN_DR_UPGRADE_CONFIG CSR = 0x0000000000000000
10220: Problem: (3)CC1, RIN Corr Err: Link one-bit failure in same position for
10221: 1 or more cycles. Corrected by HW.
10222: Possible Cause 1: RIO link cable connected to cell 1 has a poor
10223: connection or is defective. [12270208]
10224: Possible Fix 1: Reseat or replace RIO link cable.
10225: Possible Cause 2: RIO chip is defective.
10226: Possible Fix 2: Replace HIOB connected to cell 1.
10227: Possible Cause 3: CC chip on cell 1 or cell board is defective.
10228: Possible Fix 3: Replace Cell board 1.
10229:
10230:
10231: ------- Analyzing CC1, RIN_ERR_SEC_MODE CSR:
10232: RIN_ERR_SEC_MODE_CSR = 0x0000000000000008
10233: Problem: (3)CC1, RIN Corr Err: Link one-bit failure in same position for
10234: 1 or more cycles. Corrected by HW.
10235: Possible Cause 1: RIO link cable connected to cell 1 has a poor
10236: connection or is defective. [12270208]
10237: Possible Fix 1: Reseat or replace RIO link cable.
10238: Possible Cause 2: RIO chip is defective.
10239: Possible Fix 2: Replace HIOB connected to cell 1.
10240: Possible Cause 3: CC chip on cell 1 or cell board is defective.
10241: Possible Fix 3: Replace Cell board 1.
10242:
10243:
10244: Note: CC1, RIN GSM Hdr Log is NOT valid.
10245: Note: CC1, RIN Uncor. Hdr Log is NOT valid.
10246: Note: CC1, RIN Uncor. ECC Data Log is NOT valid.
10247: Note: CC1, RIN Uncor. ECC Cyc Log is NOT valid.
10248: Note: CC1, RIN FE Hdr Log is NOT valid.
10249: Note: CC1, RIN No Pres. Log is NOT valid.
10250:
10251: Note: CC1, RIN Cor. ECC Data MSB Log0 is valid.
10252: Note: CC1, RIN Single ECC Wire Log is valid.
10253:
10254: ------- Analyzing CC1 RIN_SGL_ECC_WIRE_LOG:
10255: RIN_SGL_ECC_WIRE_LOG CSR = 0x0000000004002000
10256: CC1 RIN block corrected single wire error in RIO link wire number 13.
10257: CC1 RIN block corrected single bit error in RIO link data row 2
------- Analyzing cell 1 RIO logs:
10260:
10261: Warning: RIO 0 Link PRIMARY_ERROR_LOG CSR connected to cell 1 not
10262: stored. - Analysis skipped.
10263:
10264: Note: CC 1, RIO 0, Rope Unit 0 RU_PRI_ERR_LOG CSR not stored. -
10265: Analysis skipped.
10266: Note: CC 1, RIO 0, Rope Unit 1 RU_PRI_ERR_LOG CSR not stored. -
10267: Analysis skipped.


what do you think? Thanks
15 REPLIES 15
Phil uk
Honored Contributor
Solution

Re: superdome EMS report I/O link error

Hi,

Call HP and have them send a CE to site.
I would recommend that the CE runs Scan-on-the-fly (SOTF) from the SuperDome Management Station (SMS) - this may give more clues as to where the problem actually is, ie, Cell Board,REO,Backplane, IO Backplane.
If SOTF finds errors - it may be necessary to arrange a complete outage on the machine depending what the problem might be.
(Don't try to reseat the REO cable while the SuperDome is powered on as you may damage the backplane or bend pins on the backplane.)
Note, it is very unusual for REO cables to fail.
Regards,
Phil
Genesis_1
Occasional Advisor

Re: superdome EMS report I/O link error

Thank you, Phil. I'll take your recommends. In addition, the problem has been existing for several years and keep occurring 1 times per day. But the superdome is running normally. HP CE didn't solve the problem in the warranty period. Maybe they didn't pay more attention to it.
Genesis_1
Occasional Advisor

Re: superdome EMS report I/O link error

Hi, Phil, I want to know how to use the SOTF to diag the superdome REO problem.
Urgent call, Thanks.
Phil uk
Honored Contributor

Re: superdome EMS report I/O link error

Hi,

It is a diagnostic test (called JUST) that you run from the SMS. It also depends on what sort of SMS you have as to how you run the tests (unix SMS or PC based SMS).
It is quite detailed and should be run by HP CE's etc.
Also, from the output of the diagnostics then you need to decode what the problem may be.
If you make a mistake in the sequence of events for setting up the JUST tests (the correct daemons etc) - then you can cause all nPARs/vPARs to crash.

I would strongly suggest you get HP onsite to do this.

Regards,
Phil
Genesis_1
Occasional Advisor

Re: superdome EMS report I/O link error

Thanks, I know JUST, but I don't know how to use the SOTF. My sms is a500, and the superdome can be shutdowned.
Phil uk
Honored Contributor

Re: superdome EMS report I/O link error


If you know JUST, and you CAN shut it down, then do the offline version - much better.

A500 SMS - so must be a Legacy 'Dome i guess?

logon to sms with hduser account
(password is HP proprietry - so if you know JUST then I guess you know the password)

ONLY do this if partitions are DOWN !!
run
# just -s eg, priv-01, priv-02 etc

once at JUST prompt on SMS
...select the tests you wish to run.

Don't forget to power off the whole machine (+IOX chassis) at the breakers for 1minute after you've run the test...then back on again.

Cheers,
Phil
Phil uk
Honored Contributor

Re: superdome EMS report I/O link error


.......not to mention that you need to decode all of that stuff if it picks up the errors 8-(

I still recommend you get HP to do it
Genesis_1
Occasional Advisor

Re: superdome EMS report I/O link error

Is it Aclts?
reo_link_ac_test -dt :0:9:cc -rt :0:9:reo
right?
Phil uk
Honored Contributor

Re: superdome EMS report I/O link error


Is it Aclts?
>> This is not the password for the hduser account

reo_link_ac_test -dt :0:9:cc -rt :0:9:reo
>> I'm not familiar with this....
What does that do??
....where are you running this command from??