HPE 9000 and HPE e3000 Servers
1753789 Members
7743 Online
108799 Solutions
New Discussion юеВ

fcmsutil reports massive errors

 
Stefan Schulz
Honored Contributor

fcmsutil reports massive errors

Hi all,

i have a RP3440 which shows errors in syslog.log like:

Jun 13 16:07:22 servora vmunix: DIAGNOSTIC SYSTEM WARNING:
Jun 13 16:07:22 servora vmunix: The diagnostic logging facility is no longer receiving excessive
Jun 13 16:07:22 servora vmunix: errors from the I/O subsystem. 36 I/O error entries were lost.
Jun 13 16:07:22 servora vmunix: The diagnostic logging facility has started receiving excessive
Jun 13 16:07:22 servora vmunix: errors from the I/O subsystem. I/O error entries will be lost
Jun 13 16:07:22 servora vmunix: until the cause of the excessive I/O logging is corrected.

The logtool in STM points to a FC adapter on hardware path:

Entry Type: I/O Error
Entry logged on Tue Jun 13 13:13:14 2006
Entry id: 0x448e9dca00000000

Device Path: 0/4/1/0
Product: Fibre Channel Interface
Product Qualifier: HP6795A_Tachyon_XL2
Logger: td

The fcmsutil for this device which is on /dev/td1 then shows massive errors in the stats:

server:/opt/fcms/bin# ./fcmsutil /dev/td1 stat
Tue Jun 13 14:59:30 2006
Channel Statistics

Statistics From Link Status Registers ...
Loss of signal 4194 Bad Rx Char 188881756
Loss of Sync 15855508 Link Fail 3442480
Received EOFa 0 Discarded Frame 0
Bad CRC 0 Protocol Error 0


Fortunately this server is connected to a EMC CX300 through two identical Tachyon FC adapters with EMC Powerpath an Navisphere client installed. So we have no downtime.

I allready checked the McData SAN switches, moved to another port on the switch and chenged the cable.

Still i have no link on this FC adapter.

Before shutting down the server and replacing the card i would like to check if its really this FC adapter.

How can i be sure that this adapter is dead?
Is there a way to "restart" or "reactivate" this adapter in the hope that the error is gone?

Any help is highly appreciated.

Kind Regards

Stefan
No Mouse found. System halted. Press Mousebutton to continue.
11 REPLIES 11
Michael Steele_2
Honored Contributor

Re: fcmsutil reports massive errors

Use fcmsutil test

fcmsutil device_file test remote-N-Port-ID

To get n port id

fcmsutil /dev/td? devstat all | more

to get a /dev/td use ioscan

ioscan -funC fc

Here's the link to fcmsutil

http://docs.hp.com/en/B2355-60103/fcmsutil.1M.html
Support Fatherhood - Stop Family Law
Stefan Schulz
Honored Contributor

Re: fcmsutil reports massive errors

Hi Michael,

the FC adapter in question is the one on hardware path 0/4/1/0 with the device file /dev/td1.

A fcmsutil /dev/td1 returns:

servora:/var/adm/syslog# fcmsutil /dev/td1

Vendor ID is = 0x00103c
Device ID is = 0x001029
XL2 Chip Revision No is = 2.3
PCI Sub-system Vendor ID is = 0x00103c
PCI Sub-system ID is = 0x00128c
Previous Topology = PTTOPT_FABRIC
Link Speed = UNINITIALIZED
Local N_Port_id is = 0x610c01
N_Port Node World Wide Name = 0x50060b00002d3ffd
N_Port Port World Wide Name = 0x50060b00002d3ffc
Driver state = AWAITING_LINK_UP
Hardware Path is = 0/4/1/0
Number of Assisted IOs = 451593751
Number of Active Login Sessions = 0
Dino Present on Card = NO
Maximum Frame Size = 2048
Driver Version = @(#) libtd.a HP Fibre Channel Tachyon TL/TS/XL2 Driver B.11.11.12 (AR1204) /ux/kern/kisu/TL/src/common/wsio/td_glue.c: Oct 11 2004, 14:45:41

But i have to admit that i didnt understand which N_Port_id to use for the test.

The FC adapter in question has the N_Port_id 0x610c01, the second FC adapter that is doing the work right now has the N_Port_id 0x610c13 and the device file /dev/td0 (as shown with fcmsutil ).

The command fcmsutil /dev/td1 devstat all shows three other Nport_id:

Device Statistics for Nport_id 0x610413
Device Statistics for Nport_id 0x610513
Device Statistics for Nport_id 0x610713


If i do a fcmsutil test i get the following:

servora:/var/adm/syslog# fcmsutil /dev/td1 test 0x610413
Error: Unable to login to Device at nport_id 0x610413
servora:/var/adm/syslog# fcmsutil /dev/td1 test 0x610513
Error: Unable to login to Device at nport_id 0x610513
servora:/var/adm/syslog# fcmsutil /dev/td1 test 0x610713
Error: Unable to login to Device at nport_id 0x610713

Doing the same with the other functioning adapter shows the following:

servora:/var/adm/syslog# fcmsutil /dev/td0 test 0x610513
Sent a Test frame of size 220 bytes to nport_id 0x610513
servora:/var/adm/syslog# fcmsutil /dev/td0 test 0x610413
Sent a Test frame of size 220 bytes to nport_id 0x610413


This is not surprising as fcmsutil /dev/td1 shows "Driver state : AWAITING_LINK_UP"

This looks like the FC adapter is ok, but cant get a link to the switch. So the question to me is: is the FC adapter /dev/td1 good or bad?

Thanks for your help so far.

Kind regards

Stefan
No Mouse found. System halted. Press Mousebutton to continue.
Michael Steele_2
Honored Contributor

Re: fcmsutil reports massive errors

a) put a loop back on the fc to isolate the fibre cable, or, swap in a known good cable.

b) paste in the rest of the logtool report. how many errors did the fc accumulate and in what time frame?

c) run fcmsutil /dev/td? clear_stat to zero out the numbers. then see how they climb. excercise with fcmsutil test

d)need to see 'loss of signal' numbers and 'badtx char count' numbers. obtain these with /dev/stat.

fcmsutil /dev/td? devstat | all

Support Fatherhood - Stop Family Law
Stefan Schulz
Honored Contributor

Re: fcmsutil reports massive errors

Hi again,

here are the requested informations. I put them in a seperate textfile.

We allready changed the fc cable and the port on the SAN switch with no success. At the moment it looks to me as if the card itselfe is good but the module which converts from electric to optic signal is dead.

Is there a way to check this? Can this module be replaced (could not find any hint on this on docs.hp.com)?

Kind regards

Stefan
No Mouse found. System halted. Press Mousebutton to continue.
Stefan Schulz
Honored Contributor

Re: fcmsutil reports massive errors

After some more investigation i am sure it has to be the GBIC module. This "should" be replacable from the outside even while the server is running.

I just can't see how. There has to be a special trick or i need a special tool for this.

Does anyone knwo how to replace the GBIC of a A6795A Tachyon FC adapter?

kind regards

Stefan
No Mouse found. System halted. Press Mousebutton to continue.
Ninad_1
Honored Contributor

Re: fcmsutil reports massive errors

Stefan,

I am not familiar with the card you have mentioned but As far as I am aware we do not need seperate GBIC module at the FC adapter end as it is in-built - You just plug in the FC cable. You need GBIC at the SAN switch end and it can be pulled out.

Regards,
Ninad
Stefan Schulz
Honored Contributor

Re: fcmsutil reports massive errors

Hi,

yes it is build-in but i also have been told that it is replacable. These laser modules seem to be the weak point on FC adapters. Therefore they are normally easy to replace.

In our case the PCI card itselfe seems to be ok, but the laser module (GBIC) seems to be broken.

If i could replace this module from the outside there would be no downtime.

Regards

Stefan
No Mouse found. System halted. Press Mousebutton to continue.
Michael Steele_2
Honored Contributor

Re: fcmsutil reports massive errors

From your logtool report:

Starting Date: Tue Jun 13 13:13:14 2006

Ending Date: Tue Jun 13 14:31:48 2006

In 78 minutes logtool reported 100 errors for 0/4/1/0, HP6795A_Tachyon_XL2, td. So this is the problem device. And fcmsutil reported '23,763' Bad_Rx_Char after one minute of clear_stat to device /dev/td1. There was also (432) Link Fails for /dev/td1. Verify that /dev/td1 is 0/4/1/0 in ioscan and that fcmsutil and logtool are in agreement.

Think about replacing the whole HBA and not just the gbic:

http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=1029139
Support Fatherhood - Stop Family Law
Sandman!
Honored Contributor

Re: fcmsutil reports massive errors

imho...your HBA might be fine but the problem could lie in either the SAN fabric or the storage layer or the linkage cables (either between server and switch or switch and array or both) are bad and need to be replaced. Could you attach the output of the following:

# ioscan -funC disk -H 0/4/1/0

~thanks