PVLinks not found and then recovered

dev44 · ‎09-09-2009

Hi,

We have both HP-UX and Linux RHEL 5.x running. There was a disk that died on the SAN and rebuilt with the hot spare. HP-UX showed the PVLinks failures and then 3 minutes later recovered. Everything was fine with the filesystems, LVs, Vgs, etc. The Linux systems died. They also reported the errors and did not recover. They had to be rebooted.
So, do you have any ideas of where I can look and what I can fix on the linux side so that it won't die if this happens again. Like, there must be a reason it didn't recover and the hp-ux systems did. Maybe a timeout or buffer in dm-multipath....or somewhere else?

Thanks,

syslog - HP-UX:

LVM: VG 64 0x500000: PVLink 31 0x060200 Failed! The PV is not accessible.
Sep 3 07:13:59 server vmunix: LVM: VG 64 0x500000: PVLink 31 0x140200 Failed! The PV is not accessible.
Sep 3 07:14:00 server vmunix: LVM: VG 64 0x100000: PVLink 31 0x0e0100 Failed! The PV is not accessible.
Sep 3 07:14:00 server vmunix: LVM: VG 64 0x100000: PVLink 31 0x0c0100 Failed! The PV is not accessible.
Sep 3 07:14:02 server vmunix: LVM: VG 64 0x600000: PVLink 31 0x100300 Failed! The PV is not accessible.
Sep 3 07:14:02 server vmunix: LVM: VG 64 0x600000: PVLink 31 0x0a0300 Failed! The PV is not accessible.
Sep 3 07:15:34 server vmunix: LVM: VG 64 0x500000: PVLink 31 0x060200 Recovered.
Sep 3 07:15:34 server vmunix: LVM: VG 64 0x500000: PVLink 31 0x140200 Recovered.
Sep 3 07:15:34 server vmunix: LVM: VG 64 0x100000: PVLink 31 0x0e0100 Recovered.
Sep 3 07:15:34 server vmunix: LVM: VG 64 0x100000: PVLink 31 0x0c0100 Recovered.
Sep 3 07:15:37 server vmunix: LVM: VG 64 0x600000: PVLink 31 0x100300 Recovered.
Sep 3 07:15:37 server vmunix: LVM: VG 64 0x600000: PVLink 31 0x0a0300 Recovered.

Messages file - linux:
Sep 3 07:14:47 linux kernel: SCSI error : <1 0 3 1> return code = 0x20000
Sep 3 07:14:47 linux kernel: end_request: I/O error, dev sdac, sector 173360919
Sep 3 07:14:47 linux kernel: device-mapper: dm-multipath: Failing path 65:192.
Sep 3 07:14:47 ncras144 kernel: SCSI error : <1 0 3 1> return code = 0x20000
Sep 3 07:14:47 linux kernel: end_request: I/O error, dev sdac, sector 173360959
Sep 3 07:14:47 linux kernel: end_request: I/O error, dev sdac, sector 173360967
Sep 3 07:14:47 linux kernel: SCSI error : <1 0 3 1> return code = 0x20000
Sep 3 07:14:47 linux kernel: end_request: I/O error, dev sdac, sector 173360999

Linux dm-multipath config:
defaults {
user_friendly_names yes
}

defaults {
multipath_tool "/sbin/multipath -v0"
udev_dir /dev
polling_interval 10
selector "round-robin 0"
path_grouping_policy multibus
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
prio_callout /bin/true
default_features "0"
path_checker readsector0
rr_min_io 100
rr_weight priorities
failback immediate
no_path_retry fail
user_friendly_name yes
}
devnode_blacklist {
devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
devnode "^hd[a-z]"
devnode "^cciss!c[0-9]d[0-9]*"
}

whatever

Ivan Krastev · ‎09-09-2009

How is your VG organized? Maybe your LV on HP-UX are mirrored, but not in the Linux box.

That may be a reason for recovering HP-UX lvm after crash.

regards,
ivan

TTr · ‎09-09-2009

> There was a disk that died on the SAN and rebuilt with the hot spare

What kind of array was this disk a member of? If a disk dies and the hot spare kicks in and it gets rebuilt, there should be no i/o interruption and no failures on the server side.

> PVLink 31 0x060200 Failed! The PV is not accessible. ...
> PVLink 31 0x060200 Recovered.

Unless you left out some more errors from the syslog about pv-links, this is NOT a multipath error. This is a simple path that failed and then it came back.

Why the HP-UX server sustained and the Linux did not, it may have to do with the i/o demand on each server.

When a disk like this fails and there is NO i/o from the server, there will be NO hung processes. The server can come back when the disk comes back.

If there are too many processes that were hung because they were doing i/o when the disk failed, then most likely these processes will never come back when the disk comes back and the server needs rebooting.

As I said earlier, you need to dig deeper as to what happened when the disk failed and which disk arrays failed and how they impacted the servers.

Michael Steele_2 · ‎09-09-2009

Hi

Please paste this report or attach

cstm>>-EOF
runutil logtool
rs
EOF

Thanks

Support Fatherhood - Stop Family Law

dev44 · ‎09-10-2009

Here is the start of the HP-UX syslog errors, so this might help:

Sep 3 07:13:29 server vmunix: LVM: WARNING: VG 64 0x500000: LV 1: Some I/O requests to this LV are waiting
Sep 3 07:13:29 server vmunix: indefinitely for an unavailable PV. These requests will be queued until
Sep 3 07:13:29 server vmunix: the PV becomes available (or a timeout is specified for the LV).
Sep 3 07:13:30 server vmunix: LVM: WARNING: VG 64 0x100000: LV 3: Some I/O requests to this LV are waiting
Sep 3 07:13:30 server vmunix: LVM: WARNING: VG 64 0x100000: LV 1: Some I/O requests to this LV are waiting
Sep 3 07:13:31 server vmunix: LVM: WARNING: VG 64 0x500000: LV 2: Some I/O requests to this LV are waiting
Sep 3 07:13:32 server vmunix: LVM: WARNING: VG 64 0x600000: LV 1: Some I/O requests to this LV are waiting
Sep 3 07:13:37 server vmunix: LVM: WARNING: VG 64 0x500000: LV 3: Some I/O requests to this LV are waiting
Sep 3 07:13:37 server vmunix:
Sep 3 07:13:37 server vmunix: SCSI: Read error -- dev: b 31 0x060200, errno: 126, resid: 1024,
Sep 3 07:13:37 server vmunix: indefinitely for an unavailable PV. These requests will be queued until
Sep 3 07:13:37 server above message repeats 5 times
Sep 3 07:13:37 server vmunix: blkno: 8, sectno: 16, offset: 8192, bcount: 1024.
Sep 3 07:13:37 server vmunix: the PV becomes available (or a timeout is specified for the LV).
Sep 3 07:13:37 server above message repeats 5 times
Sep 3 07:13:37 server vmunix: LVM: WARNING: VG 64 0x600000: LV 2: Some I/O requests to this LV are waiting
Sep 3 07:13:37 server vmunix: indefinitely for an unavailable PV. These requests will be queued until
Sep 3 07:13:37 server vmunix: the PV becomes available (or a timeout is specified for the LV).
Sep 3 07:13:37 server vmunix: LVM: WARNING: VG 64 0x600000: LV 3: Some I/O requests to this LV are waiting
Sep 3 07:13:37 server vmunix:
Sep 3 07:13:37 server vmunix: SCSI: Read error -- dev: b 31 0x100300, errno: 126, resid: 1024,
Sep 3 07:13:37 server vmunix: indefinitely for an unavailable PV. These requests will be queued until
Sep 3 07:13:37 server vmunix: blkno: 8, sectno: 16, offset: 8192, bcount: 1024.
Sep 3 07:13:37 server vmunix: the PV becomes available (or a timeout is specified for the LV).
Sep 3 07:13:41 server vmunix: LVM: WARNING: VG 64 0x100000: LV 2: Some I/O requests to this LV are waiting
Sep 3 07:13:41 server vmunix: indefinitely for an unavailable PV. These requests will be queued until
Sep 3 07:13:41 server vmunix: the PV becomes available (or a timeout is specified for the LV).
Sep 3 07:13:51 server vmunix:
Sep 3 07:13:51 server vmunix: SCSI: Write error -- dev: b 31 0x0e0100, errno: 126, resid: 8192,
Sep 3 07:13:51 server vmunix: blkno: 2103624, sectno: 4207248, offset: 2154110976, bcount: 8192.
Sep 3 07:13:51 server vmunix: SCSI: Read error -- dev: b 31 0x0e0100, errno: 126, resid: 8192,
Sep 3 07:13:51 server vmunix: blkno: 27807704, sectno: 55615408, offset: 28475088896, bcount: 8192.
Sep 3 07:13:51 server vmunix: SCSI: Async write error -- dev: b 31 0x0e0100, errno: 126, resid: 1024,
Sep 3 07:13:51 server vmunix: blkno: 1183517, sectno: 2367034, offset: 1211921408, bcount: 1024.
Sep 3 07:13:51 server vmunix: SCSI: Write error -- dev: b 31 0x0e0100, errno: 126, resid: 1024,
Sep 3 07:13:51 server vmunix: blkno: 32286088, sectno: 64572176, offset: 33060954112, bcount: 1024.
Sep 3 07:13:51 server vmunix: LVM: WARNING: VG 64 0x100000: LV 4: Some I/O requests to this LV are waiting
Sep 3 07:13:51 server vmunix:
Sep 3 07:13:51 server above message repeats 3 times
Sep 3 07:13:51 server vmunix: indefinitely for an unavailable PV. These requests will be queued until
Sep 3 07:13:51 server vmunix: the PV becomes available (or a timeout is specified for the LV).
Sep 3 07:13:51 server vmunix:
Sep 3 07:13:51 server vmunix: SCSI: Async write error -- dev: b 31 0x0e0100, errno: 126, resid: 8192,
Sep 3 07:13:51 server vmunix: blkno: 5521728, sectno: 11043456, offset: 5654249472, bcount: 8192.
Sep 3 07:13:51 server vmunix: blkno: 5233480, sectno: 10466960, offset: 5359083520, bcount: 8192.
Sep 3 07:13:51 server vmunix: blkno: 2773608, sectno: 5547216, offset: 2840174592, bcount: 8192.
Sep 3 07:13:51 server vmunix: blkno: 2327616, sectno: 4655232, offset: 2383478784, bcount: 8192.
Sep 3 07:13:51 server vmunix: blkno: 2327656, sectno: 4655312, offset: 2383519744, bcount: 8192.
Sep 3 07:13:51 server vmunix: blkno: 2170960, sectno: 4341920, offset: 2223063040, bcount: 8192.
Sep 3 07:13:51 server vmunix: blkno: 2170968, sectno: 4341936, offset: 2223071232, bcount: 8192.
Sep 3 07:13:51 server vmunix: blkno: 1147664, sectno: 2295328, offset: 1175207936, bcount: 8192.
Sep 3 07:13:51 server vmunix: blkno: 1147696, sectno: 2295392, offset: 1175240704, bcount: 8192.
Sep 3 07:13:51 server vmunix: blkno: 676432, sectno: 1352864, offset: 692666368, bcount: 8192.
Sep 3 07:13:51 server vmunix: blkno: 677440, sectno: 1354880, offset: 693698560, bcount: 8192.
Sep 3 07:13:51 server vmunix: blkno: 228592, sectno: 457184, offset: 234078208, bcount: 8192.
Sep 3 07:13:51 server vmunix: blkno: 250832, sectno: 501664, offset: 256851968, bcount: 8192.
Sep 3 07:13:51 server vmunix: blkno: 249256, sectno: 498512, offset: 255238144, bcount: 8192.
Sep 3 07:13:51 server vmunix: blkno: 249240, sectno: 498480, offset: 255221760, bcount: 8192.
Sep 3 07:13:51 server vmunix: blkno: 171600, sectno: 343200, offset: 175718400, bcount: 8192.
Sep 3 07:13:51 server vmunix: blkno: 171632, sectno: 343264, offset: 175751168, bcount: 8192.
Sep 3 07:13:51 server vmunix: SCSI: Read error -- dev: b 31 0x0e0100, errno: 126, resid: 1024,
Sep 3 07:13:51 server vmunix: SCSI: Async write error -- dev: b 31 0x0e0100, errno: 126, resid: 8192,
Sep 3 07:13:51 server above message repeats 16 times
Sep 3 07:13:51 server vmunix: blkno: 8, sectno: 16, offset: 8192, bcount: 1024.
Sep 3 07:13:51 server vmunix: SCSI: Async write error -- dev: b 31 0x0e0100, errno: 126, resid: 8192,
Sep 3 07:13:51 server vmunix: blkno: 53792, sectno: 107584, offset: 55083008, bcount: 8192.
Sep 3 07:13:51 server vmunix: blkno: 18088, sectno: 36176, offset: 18522112, bcount: 8192.
Sep 3 07:13:51 server vmunix: SCSI: Write error -- dev: b 31 0x0e0100, errno: 126, resid: 1024,
Sep 3 07:13:51 server vmunix: SCSI: Async write error -- dev: b 31 0x0e0100, errno: 126, resid: 8192,
Sep 3 07:13:51 server vmunix: blkno: 1416, sectno: 2832, offset: 1449984, bcount: 1024.
Sep 3 07:13:51 server vmunix: SCSI: Async write error -- dev: b 31 0x0e0100, errno: 126, resid: 8192,
Sep 3 07:13:51 server vmunix: blkno: 27690664, sectno: 55381328, offset: 28355239936, bcount: 8192.
Sep 3 07:13:59 server vmunix: LVM: VG 64 0x500000: Lost quorum.
Sep 3 07:13:51 server vmunix:
Sep 3 07:13:59 server above message repeats 21 times
Sep 3 07:13:59 server vmunix: This may block configuration changes and I/Os. In order to reestablish quorum at least 1 of the fo
llowing PVs (represented by current link) must become available:
Sep 3 07:13:59 server vmunix: <31 0x060200>
Sep 3 07:13:59 server vmunix: LVM: VG 64 0x500000: PVLink 31 0x060200 Failed! The PV is not accessible.
Sep 3 07:13:59 server vmunix: LVM: VG 64 0x500000: PVLink 31 0x140200 Failed! The PV is not accessible.
Sep 3 07:14:00 server vmunix: LVM: VG 64 0x100000: Lost quorum.
Sep 3 07:14:00 server vmunix: This may block configuration changes and I/Os. In order to reestablish quorum at least 1 of the fo
llowing PVs (represented by current link) must become available:
Sep 3 07:14:00 server vmunix: <31 0x0e0100>
Sep 3 07:14:00 server vmunix: LVM: VG 64 0x100000: PVLink 31 0x0e0100 Failed! The PV is not accessible.
Sep 3 07:14:00 server vmunix: LVM: VG 64 0x100000: PVLink 31 0x0c0100 Failed! The PV is not accessible.
Sep 3 07:14:02 server vmunix: LVM: VG 64 0x600000: Lost quorum.
Sep 3 07:14:02 server vmunix: This may block configuration changes and I/Os. In order to reestablish quorum at least 1 of the fo
llowing PVs (represented by current link) must become available:
Sep 3 07:14:02 server vmunix: <31 0x100300>
Sep 3 07:14:02 server vmunix: LVM: VG 64 0x600000: PVLink 31 0x100300 Failed! The PV is not accessible.
Sep 3 07:14:02 server vmunix: LVM: VG 64 0x600000: PVLink 31 0x0a0300 Failed! The PV is not accessible.
Sep 3 07:14:06 server inetd[19871]: registrar/tcp: Connection from localhost (127.0.0.1) at Thu Sep 3 07:14:06 2009
Sep 3 07:15:28 server inetd[19938]: registrar/tcp: Connection from server.tc.gc.ca (10.10.10.20) at Thu Sep 3 07:15:28 2009
Sep 3 07:15:34 server vmunix: LVM: VG 64 0x500000: Reestablished quorum.
Sep 3 07:15:34 server vmunix: LVM: VG 64 0x500000: PVLink 31 0x060200 Recovered.
Sep 3 07:15:34 server vmunix: LVM: VG 64 0x500000: PVLink 31 0x140200 Recovered.
Sep 3 07:15:34 server vmunix: LVM: VG 64 0x100000: Reestablished quorum.
Sep 3 07:15:34 server vmunix: LVM: VG 64 0x100000: PVLink 31 0x0e0100 Recovered.
Sep 3 07:15:34 server vmunix: LVM: VG 64 0x100000: PVLink 31 0x0c0100 Recovered.
Sep 3 07:15:37 server vmunix: LVM: NOTICE: VG 64 0x500000: LV 1: All I/O requests to this LV that were
Sep 3 07:15:37 server vmunix: waiting indefinitely for an unavailable PV have now completed.
Sep 3 07:15:37 server vmunix: LVM: NOTICE: VG 64 0x500000: LV 2: All I/O requests to this LV that were
Sep 3 07:15:37 server vmunix: LVM: NOTICE: VG 64 0x500000: LV 3: All I/O requests to this LV that were
Sep 3 07:15:37 server vmunix: LVM: NOTICE: VG 64 0x100000: LV 2: All I/O requests to this LV that were
Sep 3 07:15:37 server vmunix: LVM: NOTICE: VG 64 0x100000: LV 4: All I/O requests to this LV that were
Sep 3 07:15:37 server vmunix: LVM: NOTICE: VG 64 0x100000: LV 1: All I/O requests to this LV that were
Sep 3 07:15:37 server vmunix: LVM: NOTICE: VG 64 0x100000: LV 3: All I/O requests to this LV that were
Sep 3 07:15:37 server vmunix: LVM: VG 64 0x600000: Reestablished quorum.
Sep 3 07:15:37 server vmunix: LVM: VG 64 0x600000: PVLink 31 0x100300 Recovered.
Sep 3 07:15:37 server vmunix: LVM: VG 64 0x600000: PVLink 31 0x0a0300 Recovered.

whatever

dev44 · ‎09-10-2009

Also, just to mention that there was I/O on all the systems. There were 4 HP-UX systems, 2 Linux, and a couple of Windows. Only the HP-UX systems recovered. They are the database servers. So I am just trying to find out why they recovered and nothing else did.

whatever

dev44 · ‎09-10-2009

I can't find alot of info on proactive polling which we have turned on. So is it possible that the proactive polling picked up the problem with both paths lost to the disk and buffered the i/o until the paths were recovered?

whatever

TTr · ‎09-10-2009

> There were 4 HP-UX systems, 2 Linux, and a couple of Windows. Only the HP-UX systems recovered.

Does this tell you something about HP-UX?

There are more failures here than just a simple disks in the array. There are either whole LUNs (whole array groups) that failed in the disk array or BOTH your fiber switches or the entire disk array rebooted. I see multiple LVM volume groups missing disks.

Also there is NO evidence of PV-links or any other form of path failover here. Either you don't have path redundancy or both SAN switches rebooted at the same time or the array went down. The amount of downtime - 2 minutes- (at least that's what you show) indicates that there may have been a SAN or array controller rebot.

You need to check your volume groups (VG) and identify all the LUNs in them with their paths, path redundancy etc.

Also check the event logs of your SAN switches and disk array(s).

dev44 · ‎09-10-2009

According to the SAN people, a disk in the array died and was rebuilt using the hot spare. I am unix only and have no control of, or access to the SAN other than the luns assigned to my systems.

The rebuild of the hot spare should have been completely transparent to our systems, but it wasn't and caused downtime on a few of them. So in the event that this happens again, we would like to possibly have a fix for the linux systems...ie: larger buffer, timeouts, etc.

whatever

TTr · ‎09-10-2009

I would ask the SAN people for more details and show you event logs from the array. I think they are hiding something from you.

Sudeesh · ‎09-10-2009

If one disk on the array died and re-built from hotspare, why you should get a PVlink failure?

PVLink failure will occure when there is a connectivity loss. Local FC, switch or the link at storage end can cause this.

I suspect storage guys .....:(

The most predictable thing in life is its unpredictability

Sudeesh · ‎09-10-2009

From the logs, I see that you actually lost few PVs .....pls check with SAN team why it happend

The most predictable thing in life is its unpredictability

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

PVLinks not found and then recovered

PVLinks not found and then recovered

Re: PVLinks not found and then recovered

Re: PVLinks not found and then recovered

Re: PVLinks not found and then recovered

Re: PVLinks not found and then recovered

Re: PVLinks not found and then recovered

Re: PVLinks not found and then recovered

Re: PVLinks not found and then recovered

Re: PVLinks not found and then recovered

Re: PVLinks not found and then recovered

Re: PVLinks not found and then recovered

Re: PVLinks not found and then recovered