Re: trying to find cause of syslog errors...

Jenni Wolgast · ‎09-27-2007

After a week of escalating this with HP support I still cannot get an explanation for this so I'm hoping that maybe someone here might have encountered something like this before...

Last week in the middle of the afternoon, our rp7400 running HP-UX 11.11 started running really slow and disk util spiked to 100%... Nothing was running that should have caused this but it cleared up after about a half hour so we didn't worry about it too much. That night when we tried to restore some large files, the problem came back and a restore that normally takes 4 hours wasn't even half done after 4 hours. We have a VA7110 and normally when this starts happening it means we lost a disk. None of the error lights were on on any of the disks and armdsp showed everything was happy.

When I looked in syslog though there were all kinds of ugly errors, here's a chunk from this morning:

Sep 27 08:13:36 DEVUX vmunix:
Sep 27 08:13:36 DEVUX vmunix: SCSI: Write error -- dev: b 31 0x030200, errno: 126, resid: 8192,
Sep 27 08:13:36 DEVUX vmunix: blkno: 12057216, sectno: 24114432, offset: 3756654592, bcount: 8192.
Sep 27 08:13:36 DEVUX vmunix: SCSI: Async write error -- dev: b 31 0x030200, errno: 126, resid: 4096,
Sep 27 08:13:36 DEVUX vmunix: blkno: 2119788, sectno: 4239576, offset: 2170662912, bcount: 4096.
Sep 27 08:13:36 DEVUX vmunix: blkno: 1968948, sectno: 3937896, offset: 2016202752, bcount: 4096.
Sep 27 08:13:36 DEVUX vmunix: SCSI: Read error -- dev: b 31 0x030200, errno: 126, resid: 1024,
Sep 27 08:13:36 DEVUX vmunix: SCSI: Async write error -- dev: b 31 0x030200, errno: 126, resid: 4096,
Sep 27 08:13:36 DEVUX vmunix: blkno: 8, sectno: 16, offset: 8192, bcount: 1024.
Sep 27 08:13:36 DEVUX vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0x000000007666a000), from raw device 0x1f030200 (with priority: 0, and current flags: 0x40) to raw device 0x1f0d0200 (with priority: 1, and current flags: 0x0).
Sep 27 08:13:36 DEVUX vmunix:
Sep 27 08:13:36 DEVUX vmunix: LVM: VG 64 0x010000: PVLink 31 0x030200 Failed! The PV is still accessible.
Sep 27 08:13:41 DEVUX vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0x000000007666a000), from raw device 0x1f0d0200 (with priority: 1, and current flags: 0x0) to raw device 0x1f030200 (with priority: 0, and current flags: 0x80).
Sep 27 08:13:41 DEVUX vmunix: LVM: VG 64 0x010000: PVLink 31 0x030200 Recovered.
Sep 27 08:25:09 DEVUX syslog: CVSDM; INFORMATION Event Code=400; Description=348:FRONTEND_FC_ABTS_EVENT_EH This error code indicates that the Host sent a Fibre Channel ABTS (Abort Sequence) BLS frame to the abort an IO. The array will log this event for informational and debug purposes only. It does not necessarily indicate a problem with the array.; 0x17028a: 0x0070; frontend osPortCB: ABTSAbort-JCB; NPortID=0x010100,OXID=0x0070,LUN=0x4006; CDB=0x2a000519929000001000000000000000; et=10.0 qd=22 hwm=47 jl=0x00002000 dl=0x00002000 ro=0x00000000 js=1; enclosureId/slot/component/subcomponent : 0x00/0x00/0x70/0xff; controller tick : 23258665185585; serialNum/moduleId/processId : 00PR00D53003/0x47/0xffffffff; ; Hardware Address=1/2/0/0.1.7.255.0.0.0; FRU Location=M/C1.H1; Vendor ID=HP; Model ID=A6189B; Product S/N=00USE5030E5W; Latest information on this event at http://docs.hp.com/hpux/content/hardware/ems/RemoteMonitor.htm#400

It isn't always the same PV and the problem seems to come and go, it usually shows up between 8 and 9 am then again mid-afternoon and again at night when we do restores... Any ideas or suggestions would be much appreciated, thanks!!!!

Tim Nelson · ‎09-27-2007

Looks to be switching from one path to another. If this is the case I would guess a fibre channel issue. Local HBA, check status with fcmsutil. Switch port, check port errors. Finally possible array host adapter.

Without knowing your SAN topology there could also be response time issues. Is this a flat SAN or edge/core ?

How about pv timeout settings ? set at 90 sec or above ?

Torsten. · ‎09-27-2007

The VA complains about the C1 hostport 1 - to it looks like the fiber connection between the array and the server gets lost. For this reason, the PV (the LUN in the VA) get's disconnected via this path.

May be a bad controller port or cable.
Let them check the VA logs.

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.
__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!

Patrick Wallek · ‎09-27-2007

Some things I would check:

1) The controllers on the VA
2) Your fibre cards on the server
3) If you are going through a fibre switch, check the switch ports.

The fact that LUN switches are occurring indicates that there is a loss of communication somewhere.

Jenni Wolgast · ‎09-27-2007

ok, forgot an important piece of info, I made them come out and replace controller 1 on the VA on Friday since that CVSDM error was much more frequent even when we weren't noticing performance issues....

Jenni Wolgast · ‎09-27-2007

More info:
Server has two fiber cards that connect to a fiber switch then to the VA... Here is the result of the fcmsutil:

[root@DEVUX]:/home/root ->ioscan -funC fc
Class I H/W Path Driver S/W State H/W Type Description
=================================================================
fc 0 0/8/0/0 td CLAIMED INTERFACE HP Tachyon XL2 Fibre Channe
l Mass Storage Adapter
/dev/td0
fc 1 1/2/0/0 td CLAIMED INTERFACE HP Tachyon XL2 Fibre Channe
l Mass Storage Adapter
/dev/td1
[root@DEVUX]:/home/root ->cd /opt/fcms/bin
[root@DEVUX]:/opt/fcms/bin ->fcmsutil /dev/td0

Vendor ID is = 0x00103c
Device ID is = 0x001029
XL2 Chip Revision No is = 2.3
PCI Sub-system Vendor ID is = 0x00103c
PCI Sub-system ID is = 0x00128c
Topology = PTTOPT_FABRIC
Link Speed = 2Gb
Local N_Port_id is = 0x010100
N_Port Node World Wide Name = 0x50060b00002273af
N_Port Port World Wide Name = 0x50060b00002273ae
Driver state = ONLINE
Hardware Path is = 0/8/0/0
Number of Assisted IOs = 214402553
Number of Active Login Sessions = 1
Dino Present on Card = NO
Maximum Frame Size = 2048
Driver Version = @(#) libtd.a HP Fibre Channel Tachyon
TL/TS/XL2 Driver B.11.11.13 (AR0612) /ux/kern/kisu/TL/src/common/wsio/td_glue.c:
Sep 15 2006, 18:35:32

[root@DEVUX]:/opt/fcms/bin ->fcmsutil /dev/td1

Vendor ID is = 0x00103c
Device ID is = 0x001029
XL2 Chip Revision No is = 2.3
PCI Sub-system Vendor ID is = 0x00103c
PCI Sub-system ID is = 0x00128c
Topology = PTTOPT_FABRIC
Link Speed = 2Gb
Local N_Port_id is = 0x010200
N_Port Node World Wide Name = 0x50060b0000227389
N_Port Port World Wide Name = 0x50060b0000227388
Driver state = ONLINE
Hardware Path is = 1/2/0/0
Number of Assisted IOs = 2283341
Number of Active Login Sessions = 1
Dino Present on Card = NO
Maximum Frame Size = 2048
Driver Version = @(#) libtd.a HP Fibre Channel Tachyon
TL/TS/XL2 Driver B.11.11.13 (AR0612) /ux/kern/kisu/TL/src/common/wsio/td_glue.c:
Sep 15 2006, 18:35:32

John Guster · ‎09-27-2007

fcmsutil /dev/td0 stat -s to check if there is any error counts

Jenni Wolgast · ‎09-27-2007

fcmsutil /dev/td0 stat -s
Thu Sep 27 12:33:57 2007
Channel Statistics

Statistics From Link Status Registers ...
Loss of signal 0 Bad Rx Char 255
Loss of Sync 4 Link Fail 0
Received EOFa 0 Discarded Frame 0
Bad CRC 0 Protocol Error 0

Don Wilt · ‎09-27-2007

Not sure if this help, did you happen to add more disks to the VA7110? If your utilization falls below 50% the VA is in a 'recovery' mode(I think I'm using the correct term). If so, just create a lun which puts the allocation over the 50% threshold. Hope this helps.

skt_skt · ‎09-27-2007

paste the full output of

fcmsutil /dev/tdx stat
fcmsutil /dev/tdx vpd

is the SAN switch SPOF here?
server ->connected through two fc cards -> connected to two different switches ->VA.

if it is going thorugh two switches i would suspect more on the VA itslef.

find the disk/LUN whihc represents 0x030200.Let the SAN guys have close look on the RAID group/disk group where 0x030200 belongs to.

see there is any events reported from FC cards(/var/opt/resmon/log/event.log)

For event descriptions...
http://docs.hp.com/en/diag/ems/dm_TL_adapter.htm

Jenni Wolgast · ‎10-01-2007

It ended up being another "magic" number that we hit... We had a problem before with a huge system slow down after adding a ton more disks because we hit the line where it tried to go from AutoRAID to RAID 1+0... This one appears to be a ceiling where we can only use a certain amount of our available space... We had a 550 Mb increase in used space which just happened to push us over some invisible limit... Once I purged some old files and got us back down to where we were before we started having issues, everything sped right back up to normal speed.... I really wish there was some published manual or something that lets you know the min and max amount of your available space you can use and have to have allocated before your system grinds to a halt....

Thanks to everyone for the suggestions though!

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: trying to find cause of syslog errors...

trying to find cause of syslog errors...