1757836 Members
2735 Online
108865 Solutions
New Discussion юеВ

High Availability

 
Oscar Gatson
Occasional Contributor

High Availability

The customer I work for has large deployment of HA/Service Guard across there network.
Since deploying this system, they've encountered numerous disk failure problems within in the system. One site system is setup as the following:
1. 2 (two) D390 or D370 processors with 4 internal hot swappable disk drives, which mirror each other.
2. An external disk array, which serves as additional data storage for both processors
3. One processor's primary root internal disk is mirrored by the others secondary root disk and vice versa.
4.Running HP-UX 10.20

In the case of a cluster failure on the primary processor , the secondary processor
should assume primary responsibilities until the primary processor is fixed.

We've run into situations where this has happened and the secondary processor did not assume the primary role, thus resulting in a custom application outage.

Also we've had several instances in which the disks would fail or crash either interanally or in the external disk array. I'm very skeptical about the disk failing that these could be disk failures. Are there specific software problems or patches that we need to look at for this particular problem? Is there firmware needed for the disks?
3 REPLIES 3
Stefan Farrelly
Honored Contributor

Re: High Availability


To delve into this further you need to establish what you mean by a disk failure.
Have your disks been spinning down (PV lost - Powerfail) or have you been getting SCSI errors on them ? SCSI resets ? Lots of errors in logtool ??
What made you decide the disks have been faulty ?

If a disk lets go in a big way then it is possible that it will stop SG failing over. Ive seen it happen, especially when a disk lets go in such a way that all it does is constantly send out SCSI errors or SCSI bus resets - this then affects the whole server, and if its on a disk connected to both nodes in a cluster it will affect both. These types of disk failures arent that common, but they do happen, and the only solution is to remove or power off the disk in question asap to stop the SCSI errors/resets so you can regain control of the system (in the meantime it will grind to a halt, load will shoot up etc). Ive also seen this problem happen with SCSI controllers - by mainly on Nike's, nothing else.
Im from Palmerston North, New Zealand, but somehow ended up in London...
James R. Ferguson
Acclaimed Contributor

Re: High Availability

Oscar:

I would do several things if you already haven't.

1. Make sure your MC/ServiceGuard is a current version.

2. Install the latest online Diagnostics from the SUpport Plus CD. This adds the EMS and Predictive Support facilities.

3. Configure Predictive Support to run every night. You can use your internal modem to have it dial HP Support. You will be notified of any problems. EMS is virtually self-configuring when installed. Check your root account's mail for messages.

4. Have an engineer refresh your server with current PDC code and disk firmware.

5. Take a look at document #UXSGKBAN00000102
http://us-support.external.hp.com/cki/bin/doc.pl/sid=d775a0f61067f16a5c/screen=ckiSearchResults

This has a good discussion about dissecting failover failures.

Hopefully this helps.

...JRF...
James R. Ferguson
Acclaimed Contributor

Re: High Availability

Oscar:

I would do several things if you already haven't.

1. Make sure your MC/ServiceGuard is a current version.

2. Install the latest online Diagnostics from the SUpport Plus CD. This adds the EMS and Predictive Support facilities.

3. Configure Predictive Support to run every night. You can use your internal modem to have it dial HP Support. You will be notified of any problems. EMS is virtually self-configuring when installed. Check your root account's mail for messages.

4. Have an engineer refresh your server with current PDC code and disk firmware.

5. Take a look at document #UXSGKBAN00000102
http://us-support.external.hp.com/cki/bin/doc.pl/sid=d775a0f61067f16a5c/screen=ckiSearchResults

This has a good discussion about dissecting failover failures.

Hopefully this helps.

...JRF...