HBA (TL_adapter) problem

yyghp · ‎11-23-2004

One of our server lost connection to the SAN last night, I got this from syslog:

Nov 22 19:19:30 srs083 vmunix: 0/2/1/0: Unable to access previously accessed device at nport ID 0x
b0100.
Nov 22 19:19:30 srs083 EMS [2456]: ------ EMS Event Notification ------ Value: "CRITICAL (5)" fo
r Resource: "/adapters/events/TL_adapter/0_2_1_0" (Threshold: >= " 3") Execute the followi
ng command to obtain event details: /opt/resmon/bin/resdata -R 160956418 -r /adapters/events/TL_
adapter/0_2_1_0 -n 160956429 -a

# /opt/resmon/bin/resdata -R 160956418 -r /adapters/events/TL_adapter/0_2_1_0 -n 160956429 -a

CURRENT MONITOR DATA:

Event Time..........: Mon Nov 22 19:19:30 2004
Severity............: CRITICAL
Monitor.............: dm_TL_adapter
Event #.............: 40
System..............: srs083

Summary:
Adapter at hardware path 0/2/1/0 : Unable to open previously opened target

Description of Error:

lbolt value: 314842115

Unable to access previously accessed target
nport ID = 0xb0100

Probable Cause / Recommended Action:

An attempt to re-open a device which had been opened earlier
has failed.
There should be additional logging messages which will
allow diagnosis of the problem.

Additional Event Data:
System IP Address...: 10.125.20.83
Event Id............: 0x41a2821200000000
Monitor Version.....: B.01.00
Event Class.........: I/O
Client Configuration File...........:
/var/stm/config/tools/monitor/default_dm_TL_adapter.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 1
Associated OS error log entry id(s):
0x41a2821200000000
Additional System Data:
System Model Number.............: 9000/800/rp4440
OS Version......................: B.11.11
EMS Version.....................: A.04.00
STM Version.....................: A.45.00
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/dm_TL_adapter.htm#40

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v

Component Data:
Physical Device Path....: 0/2/1/0
Vendor Id...............: 0x0000103C
Serial Number(WWN)......: 50060B0000255C72

I/O Log Event Data:

Driver Status Code..................: 0x00000028
Length of Logged Hardware Status....: 0 bytes.
Offset to Logged Manager Information: 0 bytes.
Length of Logged Manager Information: 61 bytes.

Manager-Specific Information:

Raw data from FCMS Adapter driver:
00000001 12C41C03 00000001 00000001 000B0100 2F75782F 6B65726E 2F6B6973
752F544C 2F737263 2F636F6D 6D6F6E2F 7773696F 2F74645F 6465762E 63

Now, although I can access to the LUNs in our EVA5000, but I found something abnormal:

# spmgr display
Server: srs083 Report Created: Tue, Nov 23 08:46:24 2004
Command: spmgr display
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =
Storage: 5000-1FE1-5004-5440
Load Balance: On Auto-restore: Off Balance Policy: Round Robin
Path Verify: On Verify Interval: 30
HBAs: td0 td1
Controller: P5849E1AAQE04W, Operational
P5849E1AAQD02S, Operational
Devices: c12t0d0 c12t0d1 c12t0d2 c12t0d3 c12t0d4 c12t0d5 c12t0d6
c12t0d7

TGT/LUN Device WWLUN_ID H/W_Path #_Paths
0/ 0 c12t0d0 6005-08B4-0010-102C-0000-9000-004F-0000 4
255/255/0/0.0
Controller Path_Instance HBA Preferred? Path_Status
P5849E1AAQE04W no
c4t0d1 td0 YES Active
c9t0d1 td1 YES Active

Controller Path_Instance HBA Preferred? Path_Status
P5849E1AAQD02S no
c8t0d1 td0 no Standby
c5t0d1 td1 no Standby

TGT/LUN Device WWLUN_ID H/W_Path #_Paths
0/ 1 c12t0d1 6005-08B4-0010-102C-0000-9000-0052-0000 4
255/255/0/0.1
Controller Path_Instance HBA Preferred? Path_Status
P5849E1AAQE04W no
c4t0d2 td0 no Standby
c9t0d2 td1 no Standby

Controller Path_Instance HBA Preferred? Path_Status
P5849E1AAQD02S no
c8t0d2 td0 YES Active
c5t0d2 td1 YES Active
...
...

* There are totally 4 paths to the LUNs, they are supposed to be one "Active", one "Available" and the other two "Standby", but now, two of them are "Active", there must be something wrong !

So, could you please tell me what happen and what I can do ? Hardware problem with HBA ?
Thanks !

Steven E. Protter · ‎11-23-2004

I think you can call hardware and replace the adapter. It would seem to be quite deceased.

You can check it out with cstm mstm or X based xstm

You will find it non-functional and will want to arrange replacement. Since you have an alternate path, you can afford to wait a while.

Just to be sure, I'd make sure the fabric network is working and nobody re-zoned your fiber switch.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

yyghp · ‎11-23-2004

Hi, I checked it with xstm, it looks fine now. And I don't think anyone else re-zoned the switch.
But why "spmgr" shows wrong now: two "Active" ?
Thanks !

Jeff Schussele · ‎11-23-2004

Hi,

Have you checked to see whether that nPortID is not a disk?
Run the following:

fcmsutil /dev/tdX devstat 0xb0100

to obtain info & stats on that nPortID

then to check it run:

fcmsutil /dev/tdX test 0xb0100 1024 Y

Replace X with the appropriate td value & the Y with a count vaue - the 1024 is a numeric value indicating a packet size and must be a multiple of four.
Could be you got an intermittent disk error indicating a possible imminant disk failure.

Rgds,
Jeff

PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!

Mark Greene_1 · ‎11-23-2004

As Jeff wrote, fcmsutil is always the first course of action to take when you think you have a problem with an hba. If that all checks out ok, run "ioscan -fnC disk" and see if that changes anything, and "insf -e" if necessary to regenerate the device files.

mark

the future will be a lot like now, only later

yyghp · ‎11-23-2004

Hi Jeff, here are the results:

# fcmsutil /dev/td0 devstat 0xb0100
Tue Nov 23 11:09:42 2004
Device Statistics for Nport_id 0x0b0100

Successful opens of the device 311691
Failed Open of previously opened device 10
PLOGIs sent to the device 311688
PLOGIs Timedout 10
PRLIs sent to the device 311678
PRLIs Timedout 0
Bad PRLI resps 10
PRLOs received 0
ADISCs sent to the device 0
ADISCs Timedout 0
Authentication failures 0
LOGOs sent to the device 311678
LOGOs Timedout 0
LOGOs received 0
Target resets sent 0
Target resets failed 0
Implicit Logouts on the device 0
Bad TPRLO resp 0

PLOGI Resps error statistics ...
LS_RJTs recvd for PLOGI/PDISC sent 0
Short PLOGI Resps recvd 0
Low supported version higher than FC-PH-3 0
High supported version lower than FC-PH 4.3 0
No Class3 support 0
PWWN authentication failure 0
NWWN authentication failure 0
PLOGI retries in state dvs_open_plogi_delay 0

I/O Statistics ...
Assisted I/O requests 997744
Timedout I/Os 0
No CDB available for I/O 0
2nd Level Error Recovery 0

I/O Completion Statistics ...
Good I/O completions 568174
Read underflows 429570
Link Failure During FCP_RSP 0
FCP_RSP Overflow 0
Outbound Error For FCP_CMND 0
No resource For IO 0
Channel transient conditions 0
Channel/Device not Online 0
Implicit aborts 0
I/Os aborted 0

I/O Inbound Error Statistics ...
PLDA Non-Compliance 0
Unassisted FCP_RSP 0
Unassisted FCP_DATA 0
Bad Unassisted FCP_DATA 0
Unassisted FCP_CMND 0
UA FCP With Bad OX_ID 0
UA FCP With Bad F_CTL 0
Bad FCP_XFER_RDY 0
FCP_XFER_RDY and SEST invalid 0
FCP_XFER_RDY in invalid state 0
Bad Length For FCP_RSP Frame 0
FCP_RSP in invalid state 0
Bad Category For FCP Frame 0
Bad data_ro In FCP_XFER_RDY 0
Late ABTS Responses Received 0
BA_RJT for ABTS received 0
Bad responses to ABTS 0
IO Underruns 0

Other I/O Event Statistics ...
Unassisted FCP_XFER_RDY 0
Retries For Resources 0
Host Programming Errors 0
I/O Overflow Errors 0
LKF On Outbound Sequence 0
ASN On Outbound Sequence 0
Frame TimeOut Errors 0
Unexpected OCMs for I/Os 0
New I/Os on ERQ At LDN 0
I/Os on SLL At LDN 0
LUP Events For I/O 0
ABTS Sent 0
ABTS Resent 0
Unaccepted ABTS 0
LDNs Before Sending ABTS 0
LUPs For Sending ABTS 0
IFCM While Aborting I/O 0
FCP_RSP While Aborting I/O Requests 0
RRQ sent 0
RRQ send failures 0
RRQ replies recvd 0
-------------------------------------------------------------

# fcmsutil /dev/td0 test 0xb0100 1024 3
WARNING: Can send only up to a max of 220 bytes, continuing
Sent a Test frame of size 220 bytes to nport_id 0x0b0100
Sent a Test frame of size 220 bytes to nport_id 0x0b0100
Sent a Test frame of size 220 bytes to nport_id 0x0b0100

# fcmsutil /dev/td0 test 0xb0100 220 3
Sent a Test frame of size 220 bytes to nport_id 0x0b0100
Sent a Test frame of size 220 bytes to nport_id 0x0b0100
Sent a Test frame of size 220 bytes to nport_id 0x0b0100

So, how to solve the two "Active" problem ?
Thanks !

yyghp · ‎11-23-2004

Hi Mark, I ran:
# ioscan -fnC disk
# insf -e
but still have the same issue: two "Active" in "spmgr display" output. Is that normal ?
I found that all mount points related to the EVA work fine.
So... ?

Alzhy · ‎11-23-2004

This is NORMAL behaviour in a SecurePath + EVA/HSG environment. I do receive these messages quite often but nothing to be alarmed (so far). And as you can see from your SPMGR output .. both your HBA's are fine.

* There are totally 4 paths to the LUNs, they are supposed to be one "Active", one "Available" and the other two "Standby", but now, two of them are "Active", there must be something wrong !

Whwn you had 1 active, 1 available and 2 standby - that means your SecurePath configuration was NOT load balancing (meaning it does not use the bandwidth of the two HBAs concurrently). Someone probably set it to load balance which is the way to go .. that is why you have your LUNs now having 2 Paths on via 2 HBAs to one of the EVA controllers on which the LUNs are served.

Note that on the EVA.. each LUN can only be served on 1 EVA (HSV100/110) controller AND SecurePath Load Balancing means - you have your total number of HBA's on your server accessing the HSV controller which a LUN is assigned to. On the EVA end.. Controller/Path preferencing should be disabled so SecurePath manages which HSV controller to communicate with in accessing a LIN.

Hakuna Matata.

yyghp · ‎11-23-2004

Thanks Nelson !
Yes, maybe the load balance setting made those two "Active", I will confirm this later.
But from the SAN side, we still have the connection problem from OVSAM ( OpenView Storage Area Manager ), the host icon in the diagram is blue instead of green, which indicates bad connection:

Error: Cannot connect to host: srs083
Details: No HostAgent service contacted on host: 10.125.20.83
May not be started (check hosts logs)

So, what should I do ?
Thanks !

Alzhy · ‎11-23-2004

Check these 3 scripts (all in /sbin/rc3.d):

S076hostwatchdog
S790opendial
S800hostagent

for status and to check if all of the necessary server end processes are up.

Hakuna Matata.

yyghp · ‎11-23-2004

Hi Nelson:
yes, you are right, the two "Active" paths issue is because of the setting of "Load Balance" ! Thanks !

But please help to find out why the OVSAM still shows the connection problem. Or I need to do something to refresh the status of OVSAM diagram ?
Thanks again !

Alzhy · ‎11-23-2004

Simply do a start on the 3 RC scripts I mentioned. If the SMA still would not see the agent.. then that may very will be due to a change in IP addresses or some problem with the Java VM used by the agent.

You can try re-installing OVSAM or rebooting the client server.

Hakuna Matata.

yyghp · ‎11-23-2004

Hi Nelson, what do you mean "for status and to check if all of the necessary server end processes are up" for scripts: 076hostwatchdog,S790opendial,S800hostagent ?

Thanks !

Alzhy · ‎11-23-2004

/sbin/rc3.d/S076hostwatchdog {start|stop|restart|status}

/sbin/rc3.d/S076hostwatchdog start
...

Hakuna Matata.

yyghp · ‎11-23-2004

Hi Nelson, do you mean to reboot the agent host: srs083, which we have connection problem? or the OVSAM server ( SAN appliance )?
Thanks !

Alzhy · ‎11-23-2004

Yes your client host and most likely your SMA Windows boxen.

Hakuna Matata.

yyghp · ‎11-24-2004

Thanks Nelson !
Restarting the agent did help ! Now, I can reach the server from OVSAM.
But this morning, I found the error continued to add to the syslog.log last night:

Nov 23 17:05:43 srs083 vmunix: 0/2/1/0: Unable to access previously accessed device at nport ID 0xb0100.

However, I can access the data on LUNs without any problem, and OVSAM still can reach the server, everything looks ok.

How can I stop that message ? Thanks again !

Alzhy · ‎11-24-2004

You can't stop it as again.. these are "normal" messages due to the fact that aside from those EVAs you probably have NSRs/MDR's to present your tape Library on your SAN. At times - you will also encounter messages complaining about excessive errors from diagnostics. Most can be traced to your LUNs migrating over from one controller to another.

However, you may be able to minimize it. Have you read the release notes of SecurePath that suggests turning off EMS on the hardware Paths relating to EVA devices? It supposedly will reduce the amount of diagnostic and often times erronoeus messages coming out of STM/EMS...

Most "classical admins" who've no experience with StorageWorks SANS will often suggest HBA replacement -- which is wrong. Rememeber SecurePath/STorgaeWorks are to a certain degree not yet totally "friendly" to the HP-Us environment.
HTH.

Hakuna Matata.

yyghp · ‎11-24-2004

Cool ! Thanks again !

Categories

Company

Local Language

Forums

Discussions

Knowledge Base

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

HBA (TL_adapter) problem

HBA (TL_adapter) problem

Re: HBA (TL_adapter) problem

Re: HBA (TL_adapter) problem

Re: HBA (TL_adapter) problem

Re: HBA (TL_adapter) problem

Re: HBA (TL_adapter) problem

Re: HBA (TL_adapter) problem

Re: HBA (TL_adapter) problem

Re: HBA (TL_adapter) problem

Re: HBA (TL_adapter) problem

Re: HBA (TL_adapter) problem

Re: HBA (TL_adapter) problem

Re: HBA (TL_adapter) problem

Re: HBA (TL_adapter) problem

Re: HBA (TL_adapter) problem

Re: HBA (TL_adapter) problem

Re: HBA (TL_adapter) problem

Re: HBA (TL_adapter) problem

Re: HBA (TL_adapter) problem