HPE Community read-only access December 15, 2018
This is a maintenance upgrade. You will be able to read articles and posts, but not post or reply.
Hours:
Dec 15, 4:00 am to 10:00 am UTC
Dec 14, 10:00 pm CST to Dec 15, 4:00 am CST
Dec 14, 8:00 pm PST to Dec 15, 2:00 am PST
System Administration
cancel
Showing results for 
Search instead for 
Did you mean: 

Cant identify which disk is getting failed

 
SOLVED
Go to solution
Waqar Razi
Regular Advisor

Cant identify which disk is getting failed

In the syslog, I notice the following error messages,

Jun 11 11:37:20 HCR vmunix: LVM: VG 64 0x040000: PVLink 31 0x0e0700 Failed! The
PV is still accessible.
Jun 11 11:35:02 HCR su: + tty?? root-ccuser
Jun 11 11:37:40 HCR vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0x00000
00072d4c000), from raw device 0x1f0d1000 (with priority: 1, and current flags: 0
x0) to raw device 0x1f0e1000 (with priority: 0, and current flags: 0x80).
Jun 11 11:37:40 HCR vmunix: LVM: VG 64 0x040000: PVLink 31 0x0e0700 Recovered.
Jun 11 11:37:40 HCR vmunix: LVM: VG 64 0x040000: PVLink 31 0x0e1000 Recovered.
Jun 11 11:38:50 HCR vmunix: LVM: VG 64 0x010000: PVLink 31 0x0e0100 Failed! The
PV is still accessible.
Jun 11 11:39:00 HCR vmunix: LVM: VG 64 0x010000: PVLink 31 0x0e0100 Recovered.
Jun 11 11:39:02 HCR su: + tty?? root-ccuser
Jun 11 11:48:45 HCR vmunix: LVM: VG 64 0x040000: PVLink 31 0x0d1000 Failed! The
PV is still accessible.
Jun 11 11:50:02 HCR su: + tty?? root-ccuser
Jun 11 11:52:39 HCR above message repeats 7 times
Jun 11 11:53:45 HCR vmunix: LVM: VG 64 0x040000: PVLink 31 0x0d1000 Recovered.
Jun 11 11:53:45 HCR vmunix: LVM: VG 64 0x040000: PVLink 31 0x0e0700 Failed! The
PV is still accessible.
Jun 11 11:54:02 HCR su: + tty?? root-ccuser
Jun 11 11:55:25 HCR vmunix: LVM: VG 64 0x010000: PVLink 31 0x0e0100 Failed! The
PV is still accessible.
Jun 11 11:55:35 HCR vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0x00000
00072982000), from raw device 0x1f0d0100 (with priority: 0, and current flags: 0
x40) to raw device 0x1f0e0100 (with priority: 1, and current flags: 0x80).
Jun 11 11:55:45 HCR vmunix: LVM: VG 64 0x010000: PVLink 31 0x0e0100 Recovered.
Jun 11 11:54:02 HCR su: + tty?? root-ccuser
Jun 11 11:55:45 HCR vmunix: LVM: VG 64 0x010000: PVLink 31 0x0d0100 Failed! The
PV is still accessible.
Jun 11 11:58:02 HCR su: + tty?? root-ccuser
Jun 11 11:58:50 HCR vmunix: LVM: VG 64 0x040000: PVLink 31 0x0d1000 Failed! The
PV is still accessible.
Jun 11 11:59:10 HCR vmunix: LVM: VG 64 0x040000: PVLink 31 0x0e0700 Recovered.
Jun 11 11:59:10 HCR vmunix: LVM: VG 64 0x040000: PVLink 31 0x0d1000 Recovered.
Jun 11 12:03:02 HCR su: + tty?? root-ccuser
Jun 11 12:03:50 HCR above message repeats 5 times
Jun 11 12:04:14 HCR su: + tty?? root-root
Jun 11 12:05:02 HCR su: + tty?? root-ccuser
Jun 11 12:06:15 HCR vmunix: LVM: VG 64 0x040000: PVLink 31 0x0d1000 Failed! The
PV is still accessible.
Jun 11 12:11:15 HCR vmunix: LVM: VG 64 0x040000: PVLink 31 0x0d1000 Recovered.
Jun 11 12:11:15 HCR vmunix: LVM: VG 64 0x040000: PVLink 31 0x0e0700 Failed! The
PV is still accessible.

Can any one please give me a clue, VG 64 0x040000 is vg04 which seems to be ok.

PV Name /dev/dsk/c13t0d7
PV Name /dev/dsk/c14t0d7 Alternate Link
PV Status available
Total PE 23036
Free PE 0
Autoswitch On
Proactive Polling On

PV Name /dev/dsk/c14t1d0
PV Name /dev/dsk/c13t1d0 Alternate Link
PV Status available
Total PE 23036
Free PE 18500
Autoswitch On
Proactive Polling On


8 REPLIES
Mel Burslan
Honored Contributor
Solution

Re: Cant identify which disk is getting failed

Looks like one of the paths going to these two disks (most likely LUNs) is intermittently failing and recovering. Since the alternate path is alive, PVs keep being available. Make sure nobody is messing with SAN zoning in your environment as I presume these are LUNs on the SAN. Also make sure you do not have a bad fiber optic cable, adapter or a slot on your fiberoptic switches.
________________________________
UNIX because I majored in cryptology...
Michal Kapalka (mikap)
Honored Contributor

Re: Cant identify which disk is getting failed

hi,

check syslog for EMS messages

grep EMS /var/adm/syslog/syslog.log

ioscan -fn |grep NO_HW

check vgdisplay -v , for stale extend...

i think this could be a problem with the SAN/HBA/Cable or storage

mikap
Waqar Razi
Regular Advisor

Re: Cant identify which disk is getting failed

Do u know any commands to check the health of hba's? Or any thing else to check here, since the server is not here and we are monitoring it remotely.
Michal Kapalka (mikap)
Honored Contributor

Re: Cant identify which disk is getting failed

hi,

to check the HBA

more man fcmsutil

for example :

server:/#ioscan -fnC fc
Class I H/W Path Driver S/W State H/W Type Description
===================================================================
fc 0 0/0/2/1/0 td CLAIMED INTERFACE HP Tachyon XL2 Fibre Channel Mass Storage Adapter
/dev/td0
fc 3 0/0/3/1/0 td CLAIMED INTERFACE HP Tachyon XL2 Fibre Channel Mass Storage Adapter
/dev/td3
fc 1 0/0/9/1/0 td CLAIMED INTERFACE HP Tachyon XL2 Fibre Channel Mass Storage Adapter
/dev/td1
fc 2 0/0/10/1/0 td CLAIMED INTERFACE HP Tachyon XL2 Fibre Channel Mass Storage Adapter
/dev/td2
server:/#fcmsutil /dev/td0

Vendor ID is = 0x00103c
Device ID is = 0x001029
XL2 Chip Revision No is = 2.3
PCI Sub-system Vendor ID is = 0x00103c
PCI Sub-system ID is = 0x00128c
Topology = PTTOPT_FABRIC
Link Speed = 2Gb
Local N_Port_id is = 0x610613
N_Port Node World Wide Name = 0x50060b000060019b
N_Port Port World Wide Name = 0x50060b000060019a
Driver state = ONLINE
Hardware Path is = 0/0/2/1/0
Number of Assisted IOs = 941394616
Number of Active Login Sessions = 1
Dino Present on Card = NO
Maximum Frame Size = 2048
Driver Version = @(#) libtd.a HP Fibre Channel Tachyon XL2 Driver B.11.23.0512 $Date: 2005/09/20 12:22:47 $Revision: r11.23/1

mikap


Mel Burslan
Honored Contributor

Re: Cant identify which disk is getting failed

This does not look like a permanent problem, hence it will be very hard to diagnose. But when it comes to checking fiber optic gear, on the server, as mentioned above, fcmsutil is you main tool provided by HP for the rest, your hands and eyes by plugging and unplugging cables and looking into them to see the red light is the only way to go for most regular data centers. Places with thousands of servers might have invested in a FO signal strength meter to diagnose marginally bad cabling but I do not have that luxury. We operate on the "when in doubt, replace the cable" principle.

Good luck, as this seems to be a though one for you.
________________________________
UNIX because I majored in cryptology...
Vishu
Trusted Contributor

Re: Cant identify which disk is getting failed

Hi,

It seems to be the problem of one of the path, as PV is failing from one path and recovering from another path i.e. alternate link. so, more of it is the problem with the fibre cable or scsi. Check at your storage end.
Basheer_2
Trusted Contributor

Re: Cant identify which disk is getting failed

Hi Razi,

ioscan -fnkCdisk |grep -iv CALIMED
(find out the ones that are not CLAIMED)

or ioscan -fnkCdisk|grep -i NO_HW

vgdisplay -v |grep -e "PV N" -e "PV Status"

this will give you all the PV Names and their Status

vgdisplay -v |grep -e "PV N" -e "PV Status"|grep -vi Available

this should give you the one that is not Available (i.e a prob disk)

once you find out from the above cmds

pvdisplay onthatdisk
diskinfo

and do a dd if=/dev/.. of=/dev/null bs=1024 count=2000
of do a full scan of tht disk wihtout the count
johnsonpk
Honored Contributor

Re: Cant identify which disk is getting failed

Hi ,

Check the IO time out value for your pvs especially for the PVS c14t0d7,c14t1d0,c14t0d1,c13t0d1,c13t1d0
#pvdisplay /dev/dsk/c14t0d7
and change the pv time out according to your Storage vendor recomendation
#pvchange -t 180 /dev/dsk/c14t0d7

EMC usually recommends to use 180 secs as PV time out . It is a good practice to keep the IO time out of SAN pvs to something more than 90 sec.

Rgds
Johnson