Operating System - HP-UX
1832511 Members
5109 Online
110043 Solutions
New Discussion

Point of having a root disk mirror ? (post mortem)

 
Mel Burslan
Honored Contributor

Point of having a root disk mirror ? (post mortem)

What is the point of having a root disk mirror ?

Question most probably sounded stupid but hear my story of the night. This time I got lucky and the root disk which died, was on the standby member of a test cluster so, it did not affect anything but when this same thing happened last time on the primary node of production cluster, we got caught with our proverbial pants down as the root mirror was not bootable. Fortunately, it was not the case in tonight's tragedy.

Okay, we got this two member cluster. Both cluster members are of the same hardware configuration of each other. Connected to SAN for shared storage. Root disks are mirrored. Mirror disks reside on an external scsi bus to eliminate another single point of failure (i.e. internal scsi bus bust-out).

Obviously, the root disk, actual one the system was booted off of the last time, had died, sometime today. We got a call from HP ISEE monitoring center saying that, this server is reporting disk errors. I got on the case around 11 PM. I tried ssh from our cental management host with trusted keys, as root, and my session got hung in the middle.

This was not something new. I had the same thing when the production root disk got the melt down few months ago. I was still able to run one-off type commands from this trusted node, for example:

ssh $HOST "ps -ef | grep $USERNAME"

and the results came back as expected. But commands of more esoteric nature, had trouble completing like ioscan or make_tape_recovery. Some did some did not, for reasons unknown to me.

I had to reboot the server to the alternate disk to be able to rebuild the root disk.

My question is: why do I have a root mirror if I have to reboot everytime I lose the root disk ? Is it just because of convenience of quickly restoring after the reboot ? Isn't the purpose of having a root mirror, an uniterrupted operation ? Am I missing something here ? 11 PM is not my favorite time of the day to show up at work, especially on a Saturday night and I might not be thinking straight and I want somebody to correct where my thinking goes off kilter.

As always, thanks in advance...
________________________________
UNIX because I majored in cryptology...
14 REPLIES 14
Thayanidhi
Honored Contributor

Re: Point of having a root disk mirror ? (post mortem)

Hi,
When you have mirror, and disk failed you don't need to reboot. You can replace & rebuild them online, provided those disks are hotplug. Follow the link for useful document about replacing LVM disks.

http://docs.hp.com/en/5991-1236/When_Good_Disks_Go_Bad.pdf

I still believe there is some other issue some where else blocking you. could be SSH or some thing else. ioscan/ make_tape_recovery may hang due to defective disk still plugged to the server. If you identify the disk failure, follow the procedure to replace/rebuild.

Regds
TT
Attitude (not aptitude) determines altitude.
Mel Burslan
Honored Contributor

Re: Point of having a root disk mirror ? (post mortem)

Thanks for the info and I am well aware of the document you linked. As a matter of fact, I have my own, customized derivative of it to build and rebuild root disks and mirrors of it. But it still does not explain why I need to reboot the system instead of going forward with the mirror disk as it was perfectly healthy to boot and work from it. In my opinion, the downtime I took to reboot the machine, was not necessary. I should have been able to operate the machine non-stop and rebuild the failed drive on the fly. This is the second time this same thing happening. When the mirror copy of the boot fails, i.e., the copy where the system did NOT boot from, I am perfectly okay. I can rebuild it online.

Still looking for the answer why a reboot was necessary.
________________________________
UNIX because I majored in cryptology...
Tim D Fulford
Honored Contributor

Re: Point of having a root disk mirror ? (post mortem)

Hi

You should not HAVE to reboot to replace the root disk.. However, I assume you are using HP's LVM Mirror Disk/UX.

The title mirror disk is really misleading.. it does NOT mirror disks it mirrors logical volumes. admittedly by mirroring all the LVs on one disk onto another (one-by-one) you can achieve the same result... but the name really inplies ALL of one disk s mirrored onto the other... this is not so!!

Anyway, there are a number of gotyas that can cause problems
1 - Both disks must be pvcreated as bootable disks.
2 - The LIF area's are not mirrored under LVM as they exist OUTSIDE the LVM structure (part of my gripe about mirror disk). You must add this data to both disks (it sis static so not too onerous a task).
3 - You MUST mirror lvol1, lvol2 & lvol 3 in that order (or /stand, , / respecively) on the disks..
4 - you must set up primay and alternate disks with "setboot" or at the ISL level.
5 - make sure the LIF AUTO file is appropriate e.g. "hpux -lq" is sufficient.. but you can use other
6 - Once you have done the above it is reccomended to do "lvlnboot -Rv" to ensure all of the above is registered etc properly
7 - My favorite... test it... yup that is right, don't assume that because the above has been done correctly it will always work... V-Class servers had a "bug" that it would not boot off the mirror disk.. until it was patched...

Regards

Tim
-
Mel Burslan
Honored Contributor

Re: Point of having a root disk mirror ? (post mortem)

Again thanks for the information but everything you mentioned was done previously. I was able to boot from this mirror disk after I had to shut the server down. It still does not tell me why reboot. The logical volumes were mirrored with no stale extents, lif areas contained the right information to boot from this disk. lvlnboot commands for lvol1, 2 and 3 were run and successful. Otherwise I would not be able to boot from this disk, should any one of them be missing.
Still looking for the magical answer "why non-stop operation was not an option ?"
________________________________
UNIX because I majored in cryptology...
Alex Georgiev
Regular Advisor

Re: Point of having a root disk mirror ? (post mortem)

Mel, could you be missing patches? Could it be that the drive did not fail completely but only had media errors?

Frankly, only you are in a position to figure out why the mirroring didn't work as expected.

If this happens again you can try running tusc on the commands that hang, and see what system call makes them hang.

If you are certain that the broken disk is a problem, you can also try 'lvreduce -m 0' for certain (or all) LVs, to see if that eliminates the problem. Don't know what else to suggest... except that you should examine how your swap space is configured... and you should make sure that those external disks you are talking about don't have any performance problems.

As far as the benefits of mirroring... I supposed you could have spend a couple of hours restoring from tape along with the reboot. But you didn't have to, did you? :-)

Hope that helps!
Mohanasundaram_1
Honored Contributor

Re: Point of having a root disk mirror ? (post mortem)

Hi Mel,

My understanding is, You need to reboot the server to boot from the alternate disk. If you had a disk to replace then you could have done a hot plug.

But I guess you are asking about the scenario where you are not having the replacement drive. Mirror-disk is not meant to be a "fail-over", at least root mirror.

This is the reason why commands like ioscan hangs. You have to reboot the system with the available root mirror for the system to work without hiccups. This has been my experience as well.

Then why Mirror? - Because you still have a root disk to boot from. Without this mirror you may have to recover from ignite, which may take around 2 hours to restore, if you managed to find the ignite tapes quickly.

I may be wrong. But I thought of sharing my experience. I hope someone really answers your query - "WHY REBOOT?"

With regards,
Mohan.
Attitude, Not aptitude, determines your altitude
A. Clay Stephenson
Acclaimed Contributor

Re: Point of having a root disk mirror ? (post mortem)

I can only say that I have replaced tens of boot disks and/or the boot mirrors and never once have I shutdown to do it and never once were the users even aware that the disk had failed. Without knowing more about your configuration, I can't explain why you are not able to continue to operate.

Your failed boot disk should have been a complete non-event.
If it ain't broke, I can fix that.
Steven E. Protter
Exalted Contributor

Re: Point of having a root disk mirror ? (post mortem)

After this little story, I'm posting the procedure I used.

In short, I had a D320 mirrored this way 11i. At the time, I was running my wife's website, http://tehillimsongs.com and http://www.hpux.ws off the box. I thought it was funny running an HP-UX site off an HP-UX box right.

The box was fully mirrored, its internal 4.3 GB disks mirrored to equally sized set of disks on an HP-6000 disk array.

I still have the array, its holding down some papers in my tiny apartment. One day one of its 4.3 GB disks failed on me. I was booted off these disks, because the machine was experimental and I wanted to make sure the mirror worked.

It did.

My wife, who is quite pushy for someone who pays zero bucks for a website (I know there are fringe benefits to keeping her site up) pretty much goes ballistic when her website goes down.

Tongue and cheek aside, just like any other customer, she doesn't tolerate her website being down. Due to proper mirror configuration, it was a non-event. Since the HP-6000 array can't be worked on while the system is running, I failed her site over to Linux box(custom DNS configuration) and kept her running.

Similar scenarios got my servers at my previous job through several problems over the years without anybody noticing it except me and operations. Those stories just are not as much fun to tell.

Note lifls and other utilities can and should be used to verify that the configuration will work.

Lastly, to be sure, you have to test it.

The names of the disks have been changed to protect the innocent.

Procedure---
pvcreate -B /dev/rdsk/c1t0d0 #use real disk

mkboot -l /dev/rdsk/c1t0d0
mkboot -a "hpux -lq (;0)/stand/vmunix" /dev/rdsk/c1t0d0 # use real disk


# mkboot -b /usr/sbin/diag/lif/updatediaglif -p ISL -p AUTO -p HPUX -p PAD -p LABEL /dev/rdsk/c?t?d?

If you are running 64-bit OS:

# mkboot -b /usr/sbin/diag/lif/updatediaglif2 -p ISL -p AUTO -p HPUX -p PAD -p LABEL /dev/rdsk/c?t?d?


vgextend /dev/vg00 /dev/dsk/c1t0d0 # same thing
lvextend -m 1 /dev/vg00/lvol1 /dev/dsk/c1t0d0

# real disk. repeat for other lvols

lvlnboot -r /dev/vg00/lvol3 # root fs /
lvlnboot -s /dev/vg00/lvol2 #swap
lvlnboot -d /dev/vg00/lvol2 #swap/dump
lvlnboot -b /dev/vg00/lvol1
lvlnboot -R
lvlnboot -v
setboot
setboot -a 52.1.0 # second disk

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Torsten.
Acclaimed Contributor

Re: Point of having a root disk mirror ? (post mortem)

Hi Mel,

because of your activity in this forum I know you are not inexperienced.

The root cause of trouble is: your disk dies hard. Remember "halloween", michael stands up again and again.

Once the disk has a failure, you'll get some messages in syslog like "powerfailed". But in rare cases now the disk wakes up again and the system is trying to resync the vg. It's failing and wakes up again ... Your system will try to resync it ... you'll get a lot of pending write requests and all of them are waiting for the timeout. This slows down your system extremely. Have a look at your syslog file and you will find many entries. Especially if you run disk related commands like "ioscan".

The only way out is to pull the disk or do a "pvchange -a n ...", if you are on a appropriate patch level.

This happens not every time, but I've seen this in rare cases.

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
Sameer_Nirmal
Honored Contributor

Re: Point of having a root disk mirror ? (post mortem)

Hi Mel,

The system should be up and running even if the boot disk fail.
I do agree with what Torsten has explained.

When the boot disk fails and the PV timeout expires , you see a SCSI lbolt error alongwith POWER-FAILED in the syslog.
The only impact would be little pause equal to PV timeout ( 60-120 Secs) expiry which is normally un-noticed by the users.
Now if the disk is totally died (offline), then the system should operate normal redirecting all I/Os to the mirror disk. The trouble starts if the dieing disk has intermittent problems causing it
come back online and making LVM to do his job i,e resync the stale extents. The system would become slow to respond in that case. No doubt the commands
like ioscan would go into loop becuase of that intermittent nature of the disk functioning. If the failing disk respond ( even though) intermittently , there is no way to stop LVM to access it unless you have LVM OLAR (installed with patches ) and using pvchange -a N This is exactly might be happening in your case.

I guess you might be booting the mirror after physically removing the failed boot disk and rebuild the mirror with new disk
and system is happy again.








Bill Hassell
Honored Contributor

Re: Point of having a root disk mirror ? (post mortem)

After many years of taking down-system calls about broken disks, I can tell you that there are failure modes that cannot be isolated from the OS. For every 100 disks that fail is a clean way, there are a few that produce hangs just as Mel describes. I just went through a similar scenario. The root mirror (not the primry disk) failed and network commands started failing. lanscan, bdf (with NFS mounts), nslookup, etc all hung. Most everything else worked (telnet, Informix raw database, even logins were normal. Just certain network commands hung.

This was a fully-patched system with the latest HWE and QPK patches. Since the mirror was external, the solution was easy: pull the bad disk out, and within seconds, all networking commands started working. Now why would a bad disk affect certain networking tasks? No ideas... Since anything electronic can fail in millions of ways, it is really difficult to account for all possibilities. To add extra electronics and code to check the interface cards would slowdown every disk access (more system overhead). Similarly, the disk may respond to most commands but fail to complete and certain handshake, or bad status values are returned and the driver trips over the crazy bits.

So the answer is that all disks need to be removable to prevent the rare possibility of a hang situation. This isn't the only possible failure mode--new fibre disks, especially when going through a switch can also create stability problems. It seems that the only consistent way to get past these problems is to disconnect the defective disk. As you might imagine, remote systems can be a big challenge as you try to describe the correct disk to remove. Consider detailed labels on all the remote equipment and local photos so you direct remote sysadmins.


Bill Hassell, sysadmin
Andy Torres
Trusted Contributor

Re: Point of having a root disk mirror ? (post mortem)

Regarding your original question... If I understand your story correctly, you don't like that you had to reboot to your mirrored root volume after the primary died. You are in a cluster, but I can't tell if it failed over to the second node successfully or not (assuming ServiceGuard). I'll assume it did, and that your users saw no interruptions.

So, to answer your question of "Isn't the purpose of having a root mirror, an uniterrupted operation ?", I'll say simply... no. ServiceGuard, or other cluster, is what you use for uninterrupted operation. Mirroring your root volume is a way to simplify the recovery process.

I hope I understood your question correctly, and I helped you out.

P.S. 11PM isn't so bad. It happens. Life of a SysAdmin. :-)
Mel Burslan
Honored Contributor

Re: Point of having a root disk mirror ? (post mortem)

Well, all this reminds me a part of the movie that I am not a very big fan of due to acting and scripting, namely, "I robot". There was the recording of the professor who said something like "There is always some unexplained in the science, call it fluke, call it ghosts in the machine, like nobody knows why robots tend to stick together when they were left alone instead of standing up away from each other. There is always an unexplained piece of code inside every machine"

Yes I did not have a down time as this server was the standby server in the cluster, but losing the root disk with the same scenario on the primary server in the production clusted was very heart-wrenching. At that time, we also realized the boot LIF was also corrupted. This was the main reason for my question. Otherwise, despite I had to take the server down to release the hung processes and rebuild the new disk, was a non-issue.

Regarding the complexity of the 9000 series machines, I am writing these two events against the gremlins. And again, thanks for all the responses.
________________________________
UNIX because I majored in cryptology...
Ranjeet A.
Advisor

Re: Point of having a root disk mirror ? (post mortem)

Hi All,
After a long discussion, (the post mortem)
I wonder why the HP MOderator didn't respond......Or is there any one alive in that name???? Or does it mean we (customers) dont have alternatives ????
Is HP responsible for this forum and the discussions???? shame...

Ranjeet