strange mirror behaviour

Mark J McDonald · ‎07-12-2007

Hi All

I replaced a powerfailed disk yesterday, in a surestore rack. It was a root disk and was mirrored. I ran vgcfgrestore, vgsync and the necessary mkboot stuff. I checked each logical volume with lvdisplay -v, and all root lv's were showing current/current.

As a test I rebooted the machine and booted from the alternate path (the new disk) everything looked good. so I gave it 1 final reboot for it to boot from the primary path.

When the box came up for the final time I checked the lvdisplays again, and the whole of lvol3 (/) was showing as stale on the primary path (not the disk we replaced). All other lv's were fine.

I ran an lvsync on lvol3 and all seems ok now. Why did this happen?

Details
hpux 11.00, R390

A. Clay Stephenson · ‎07-12-2007

There is almost no way to know; the most likely explanation is that the primary disk went offline briefly. This isn't quite as lame as it sounds because during the boot process there is a significant period of time after the initial boot before / (normally lvol3) is accessed. The primary drive may have timed out during this interval of at several tens of seconds but recovered by the time that the other file systems began to mount. In any event, I would consider replacing your primary drive soon.

I assume that you have installed the latest (or more accurately last) LVM/SCSI patches for 11.0 --- which are now several years old.

If it ain't broke, I can fix that.

chris huys_4 · ‎07-12-2007

Hi Mark,

Surestore, thats a long I heard that term. ;) I suppose you have a fc10 or sc10 setup.

Anyway, how did you notice that the disk was powerfailed ? Through LVM powerfailed messages logged in syslog.log, or was the disk physically "powerfailed" ?

If its LVM powerfailed messages, maybe it was not the replaced disk who was causing them, but another hardware component on the same bus as the disks.

Also it looks like youre configuration has a external bootdisk vs the more standard, in that time,internal mirrored bootdisk configuration.

If you do it externally, you need to be sure that everything is correctly "terminated". And especially D/R but also Lclasses, for that reason, had there bootdisks mostly "internally" mirrored. (unless it was a MC/ServiceGuard config)

Anyway monitor youre configuration. If you dont see (scsi/lvm) errors popping up in /var/adm/syslog/syslog.log and the disklogs in /var/stm/logs/os/raw*.cur.log, dont not grow steadily, you should be alright.

Else, log a hardware call.

Greetz,
Chris

A. Clay Stephenson · ‎07-12-2007

Actually, external boot disks on an R box was the smart way to do because unlike their D-3XX cousins, the internal drives in an R-box are SE-SCSI and not hot-pluggable.

If it ain't broke, I can fix that.

Steven E. Protter · ‎07-12-2007

Shalom,

I've read the whole thread. Its a pretty old system maybe even beyond hardware support.

I'd run cstm mstm or xstm and excersize those disks.

I'm an adherent to the temporary timeout theory proposed by A. Clay, because I've seen it happen.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

chris huys_4 · ‎07-12-2007

Pretty sure that if the disks would have been internally, mark wouldnt have seen the above phenomenen. ;)

Marcel Burggraeve · ‎07-12-2007

Can you maybe explain why you think things would be different with internal disks ?
Just wondering since if you use correct cables and terminators there's no need to see any difference in behaviour between internal and external disks.
For systems like the (old) K-Class and R-Class it's a wise decision to use external hotswap disks since you cannot use those internally.

Mark J McDonald · ‎07-12-2007

It is a Service guard configuration, and has been working fine for years.

Its seems to be working fine again now. The surestore is an HVD10.

I'll continue to monitor the logs and see how it goes.

Thanks for your ideas.

Mark

A. Clay Stephenson · ‎07-13-2007

I'll seriously consider any hypothesis concerning cables and termination when someone can explain just how termination/cabling could confine its problems to just one LVOL (and that completely) on one PV and leave the others untouched and then continue to work flawlessly since. On the other hand, it is quite possible (though rare) for a drive to go offline during the initial boot and recover in time to access the remaining file systems. Now that I know this is really a JBOD (though a fancier version than the Jamaica), I would very strongly consider replacing that drive as well.

If it ain't broke, I can fix that.

chris huys_4 · ‎07-16-2007

You see clay, thats the benefit of the internet. You dont need to own the problem and thus youre answers dont need to be as accurate as they would need to be as, if you were supporting a customer, as a support engineer. ;)

That said, normally my intuition is good, so, a possible explanation why only lvol3 was affected and not the other lvols, might be the following.

During a bootup of a system, its the rootfilesystem, i.e. lvol3, that is getting first accessed.

In a shared scsi IO MC/SG, were vg00 bootdisks are located on the same shared scsi bus, the scsi ID of the disks, behind the vg00 rootdisks, begins to play a role, when the disks are trying to get IO from the disks.

normally you have something like the following setup.

systemA vg00 prim vg00 mir systemB
systemA systemB
scsi ID#5 scsi ID#0
init 7 +-----+------------+-------+init 6

init 7 +-----+------------+-------+init 6
scsi ID#0 scsi ID#5
vg00 mir vg00 prim
system A systemB

The above should be the correct setup. the disk with scsiID#5 is I think after the scsi ID#7 and scsi ID#6, the one that has the highest priority, when getting "served" by IO. scsi ID#0 is probably the one that "will get last served"..

Anyway in that respect, its not impossible that if a disk with a higher priority scsi ID then the primary vg00 disk, is doing IO to system B, that system A will timeout on IO to its lower scsi disk# .. and thus gets stale extents only for that lvol.

Greetz,
Chris

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

strange mirror behaviour

strange mirror behaviour

Re: strange mirror behaviour

Re: strange mirror behaviour

Re: strange mirror behaviour

Re: strange mirror behaviour

Re: strange mirror behaviour

Re: strange mirror behaviour

Re: strange mirror behaviour

Re: strange mirror behaviour

Re: strange mirror behaviour