Operating System - HP-UX
1833167 Members
3398 Online
110051 Solutions
New Discussion

Re: root mirror powerfail

 
SOLVED
Go to solution
Paula J Frazer-Campbell
Honored Contributor

root mirror powerfail

The Friday challenge :-
From My sister company.

Details.
N Class - 11.00 - FC10

All disks configured with alt path.

Fault:-
In syslog - LVM: vg[1]: pvnum=0 dex_t=XXXXXX is POWERFAILED.
This was the mirror of the /stand Lvol.

Yes the disk requires changing.

When this occurred the server stopped and did not carry on running on the primary disk, and a server reboot was required.

The IT director will accept a disk change but is demanding to know what occurred.

It is not patched to latest levels and I believe that is why this fault occurred but a definitive answer is being demanded.

This I believe cannot be given.

Does anyone have experience of this fault??

Paula



This occured this AM and
If you can spell SysAdmin then you is one - anon
7 REPLIES 7
Patrick Wallek
Honored Contributor
Solution

Re: root mirror powerfail

I have had this happen on a D330 that I've got, Paula. I had one of the vg00 mirrors die on a Friday afternoon at 4PM (naturally!). I had to wait to get the replacement disk until Monday, because I wasn't on 24x7 maintenance. I wasn't really worried. It is mirrored after all.

I got in Monday, the box was non-responsive. I wound up having to reboot from the mirror.

I got the original bad disk replaced. Well, 2 days later the other VG00 disk died. I was not too happy.

I think my system died, because both VG00 disks happened to be in their last days of life. I never have precisely figured out what happened.

So, I guess the point of the story is check your other VG00 disks and see if they are showing signs of problems and be ready to replace them too.
Krishna Prasad
Trusted Contributor

Re: root mirror powerfail

I did have this same problem.
However in my case the system did eventually come back.

The system was hung for about an hour to an hour and a half. We were lucky because the machine was an SAP application Server and SAP has a way to switch all the users to a different machine quickly. This gave us the chance to do more digging before quickly re-booting.

I think what happened to you and me was the fact that the drive was having several I/O hangups without a complete failure so the system kept trying to access the drive but timed out on some of the calls that tried to access the stale extents.

I also agree that the system should switch over to the alt. link much quicker and stop tring to use a failed device.

Positive Results requires Positive Thinking
Paula J Frazer-Campbell
Honored Contributor

Re: root mirror powerfail

Hi Guys

Both good points.

Thanks I will assign points at end as I want to keep the rabbit away at the moment.


Paula

If you can spell SysAdmin then you is one - anon
James R. Ferguson
Acclaimed Contributor

Re: root mirror powerfail

Hi Paula:

Have a look at this thread for some similar issues and things to check. The Knowledge base document I cited would not seem to apply to N-servers, but the information is worthwhile.

http://forums.itrc.hp.com/cm/QuestionAnswer/1,,0x0b5d0b0717d1d5118ff40090279cd0f9,00.html

Regards!

...JRF...
Roger Baptiste
Honored Contributor

Re: root mirror powerfail

Paula,

<This was the mirror of the /stand Lvol. >>

How soon was this error noticed?? From my experience, we would know this, when the
system seems hung or some lvol is not being
accessed or if users all over the place call
saying their system is "very slow" ;-)

The point i am trying to make is , if these
messages trigger of an ITO alert or some other similar alert, one has the chance to
lvreduce the bad disks and hold on grimly
till a replacement comes.

<>

you bet.

<>

Logically, it should not do this. Hey, that's why we have the mirrors. But, since
this is root vg00, i guess lot of pending i/os
to the mirror results in I/O hanging and ultimately locking up the system.

<>

Uh oh. For one, the disk got conked . THere
is no insurance against acts of god. Even with the best of setup, one is always open to mechanical failures.

Always follow this with HP and see whether they have the usual patch ;-) and see whether
you can setup alerts to catch this messages.
(that reminds me to do the same!).

-raj
Take it easy.
James R. Ferguson
Acclaimed Contributor

Re: root mirror powerfail

Hi (again) Paula:

How long did you wait before concluding that your catatonic server required a reboot?

It seems to me that this is a case of a hung I/O for which we want to wait.

In addition to 'pvtimeout' there is a logical volume timeout. Note this from the man pages for 'lvchange':

/begin_quote/

The lvchange command can also be used to change the timeout value for a logical volume. This can be useful to control how long an IO request will be retried (for a transient error, like a device timeout), before giving up and declaring a pending IO to be failed. The default behavior is for the system to continue to retry an IO for a transient error until the IO can complete. Thus, the IO will not be returned to the caller until the IO can complete. By setting a non-zero IO timeout value, this will set the maximum length of time that the system will retry an IO. If the IO cannot complete before the length of time specified by the IO timeout, then the IO will be returned to the caller with an error. The actual duration of the IO request may exceed the logical volume's maximum IO timeout value when the underlying physical volume(s) have timeouts which either exceed the logical volume's timeout value or are not an integer multiple of the logical volume's timeout value (see pvchange(1M) for details on how to change the IO timeout value on a physical volume).

/end_quote/

Regards!

...JRF...
Paula J Frazer-Campbell
Honored Contributor

Re: root mirror powerfail

Hi James

The fault occured at 23:31 last evening and when the sysadmin came in this AM 07:30 he found that only a partial service was running.

The server had gone past the timeout and was running, albeit that their customers could not connect via x25 to the database.

He found that a reboot cleared the problem and the disk is now not reporting errors - although I have advised him still to change it.

Paula
If you can spell SysAdmin then you is one - anon