1831929 Members
3636 Online
110031 Solutions
New Discussion

Single Points of Failure

 
SOLVED
Go to solution
MarkSyder
Honored Contributor

Single Points of Failure

Hi everybody.

Following my proposed bid for extra discs and mirrordisk, I've been asked what other single points of failure there are in the system. I came up with two:

1. PSU on the server

2. PSU on the disc array.

I made the point that both of these are quicker and easier to rectify than a disc crash. I was asked to check if there are any more. Points for any point of failure that I haven't thought of.

Mark Syder (like the drink but spelt different)
The triumph of evil requires only that good men do nothing
23 REPLIES 23
Peter Godron
Honored Contributor
Solution

Re: Single Points of Failure

Mark,
starting from the most extreme:
1. Grid failure
2. Server room failure
3. Rack cable failure
4. PSU server
5. CPU failure
On the disk and connection side:
1. PSU disk
2. I/O card failure
2. Cable failure
If on a san:
1. Network card failure
2. comms failure
The rule we use is to eliminate at least the top 80% of failures i.e. standby PSU and UPS,dual cards and connections.
And hope nothing ever happens ;-)
Stephen Keane
Honored Contributor

Re: Single Points of Failure

SCSI controller card.
NIC card.

Do you have a spare console/keyboard/mouse or a UPS? You can always rlogin in to the server, but if you need to use single user mode a spare console is useful (you can always borrow one of another server).
Eric Antunes
Honored Contributor

Re: Single Points of Failure

Hi Mark,

From my experience:

- disks failure (the most commons);
- adm or user failure (oops);
- failures due to overheating, normally disks are the firsts to fail

Eric Antunes

Each and every day is a good day to learn.
Jan van den Ende
Honored Contributor

Re: Single Points of Failure

Mark,

so trivial that it usually tends to be forgotten, but it is yhe most dangerous, and probably most often-occurring:

human error.

Can be subdivided into:

- ignorance
- clumsiness (from accidentallly hitting a power switch to miss-typing commands while privileged)
- sabotage (ie, willingly executing any detrimental activity, for whatever reason)

The chances thereoff can be diminished by restricting physical access, restricting privileges, limiting scope of dangerous commands,...
but in the end it somehow will come down on safe procedures, and plain & simple trust.

Proost.

Have one on me.
Don't rust yours pelled jacker to fine doll missed aches.
MarkSyder
Honored Contributor

Re: Single Points of Failure

Hi Gordon.

Grid failure: as in National Grid?

Rack cable failure: does this mean a cable in the server cabinet?

Please excuse the absence of points at this stage - I'm concerned that people will stop reading the thread if it has a bunny on it too soon. Points to be awarded when I think the subject has been adequately covered.

Mark
The triumph of evil requires only that good men do nothing
melvyn burnard
Honored Contributor

Re: Single Points of Failure

In some cases the answer will depend upon the server you are using.
For example, the Superdome has N+1 fans, power supplies, etc to rule out SPOF's, but still has three on it, namely the backplane, the UGUY and a thrid board that is completely passive, but if it WERE to fail.....

On older systems, they did not have n+1 fans etc.
Human "error" is certainly a SPOF, as is the actual application software you use ;-]



My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
Stephen Keane
Honored Contributor

Re: Single Points of Failure

Common causes of failure that I have personally witnessed!

Someone nearby putting a JCB bucket through a buried power cable resulting in a local power failure. Advice - UPS and/or generator

The fire alarm going off resulting in Halon gas being dumped in the computer room, due to H&S rules no one was allowed into the computer room for 24 hours - there wasn't a fire it was a £5 sensor that malfunctioned.

Availablity/contactabilty of qualified staff in the event of a problem. Advice - have clear procedures (and practice them) and contact details (periodically check them)

The power went off, the generator kicked in, but the delay caused the systems to all crash. Advice - use a UPS as well as a backup generator for brown-outs.

The power went off, the computers were OK, they were on UPS, but the lights weren't. It's quite difficult operating in a large server room in the semi-dark. Advice - don't underestimate the power of a torch!

An operator called out support staff in the middle of the night, the computer had frozen. What had actually happened was someone had typed CTRL-S on the console, it was a long trip to simply type in CTRL-Q. Advice - a little education goes a long way.
Steve Lewis
Honored Contributor

Re: Single Points of Failure

One SPOF that affected my rp8400 but also applies to 74xx and 54xx is the GSP card. It went phut, the machine lost its cell configuration and I couldn't boot at all.
You may also consider the backplane. But, if you are that paranoid you should consider a 2nd server.

Peter Godron
Honored Contributor

Re: Single Points of Failure

Mark,
yes, grid failure as in national grid. We lost power ONCE to the whole building during a storm as lighting struck nearby. Since then we have standby generators on site!
The rack cable failures was caused by bad installation and a resulting short due to fraying.
Pete
Dave Unverhau_1
Honored Contributor

Re: Single Points of Failure

Mark,

Along the lines of what Peter was saying, it's good practice to distribute power inputs between source circuits wherever you have redundant power supplies. This helps insure against the outage of a single circuit taking down the system. Also, careful planning of the loading of each source circuit is important.

Education of the staff is important here, as well. When adding components to a system, empty outlets on the PDU's are like extra checks in the checkbook... "my account can't be overdrawn -- I still have checks!" ...don't overload the circuits.

Best Regards,

Dave
Romans 8:28
Florian Heigl (new acc)
Honored Contributor

Re: Single Points of Failure

I'm merely an hobbyist that works in a HA environment, but I'll try to collect a few things I experienced:

Power !!! as in
- 360V plug fiddled togheter with a modem cable, someone pulled it.
- Lan switches not plugged into UPS
- systems only connected to UPS
- having no generator (that's not so bad, just ensure You have time for shutting down)

A reasonable system without wasting money would have two PSUs, each connected to a surge protector followed by an UPS.
The UPS can be shared by multiple systems (thus the surge protector).
It's two UPS because here in Germany the national grid has a much lower chance of failure than an mid-range UPS, but in case of a sudden failure UPSs happen to fail or blow up inside after a moment.
The power cables must be secured it both ends to avoid accidential pull outs.
also manually control their fit.

- IO backplanes:
If You have two network adapters, FC-adapters or such, put them into different IO backplanes if the system supports it (>=N-Class). (Applies to the next thingy, too)

- SCSI Adapters:
Are Your mirrored disk split over two separate controllers? They should be.

- Filesystem corruption:
fsck /stand every few months - it's an unhappy event to notice that after patching.

- NFS non-bg mounts:
Any request to a failed server will lock the clients
(i.e. nfs server has proc table full, thus rpcbind is still up and running appearingly fine, but a client won't get any data out of it)

- LVM Mirror / sdisk driver:
the following happened:
FC-Array locks up with outstanding IO from an application while one disk in vg00 was failing (motor problem, it got lost from the vg00 and got back for hours).
Result (both after reboot -h and resolving the fc-issue):
one failed disk for the bin and a corrupted mirror disk.
(Fun!)

- SCSI conection to the disk array?
use pvlink and dual-active controllers - btw: psu fail in the disk array can be trouble, what happens to Your applications if the disks are gone?

- LAN connection without APA

Orginisational stuff that becomes a SPOF when something is down:

- No easy to find tape drive and tapes at hand
If someone feels he'd better do a backup he should be able to do it.

- Lack of authority and support
The person in front of the system must be allowed to call for help and have the numbers. If he can't reach anyone, he should look right at a sheet of paper with a few things on it:
the persons responsible for the system
(not in organizational character, like who was billed for it, but the guy that knows what applications are on it)
there always should be at least two, and they *MUST* be aware they are responsible for their decisions, otherwise they'll just put it off till next day.
the HP support handle
the HP telephone number
the location of the system documentation
rough description what he is authorized to do and what hp is authorized to do
(i.e. if Your org takes the risk of hot-plugging a scsi tape for emergency backup. otherwise hp won't do it)


(i.e. a billing system is better off if it's down than with lost data)

Things that kill a system anyway and sometimes don't even result in TOC:
CPU failures (usually the cpu get's unconfigured, but if it has irq's allocated: down she goes)
IO backplane errors (bad soldering etc)
happened on a few n-classes over time
bad bad bad disk blocks in paging area.
(mediainit is Your friend sometimes)
Loose RAM sockets
(only saw that one in A180's)
faulty scsi cabling (just don't have it)

most important:
get enough authority do test problems and... downtime, maybe a whole weekend, maybe plan the test with a consultant from hp.

do full backups, test them on a second system.

remove a power supply. remove pull out the whole vg00, fsck up the fc-zoning to external storage.
document what breaks everything, and what doesn't.
practice
have a backup system at hand. i.e. only use external storage, and have a small fallback system.
i.e. if Your production host is a fully redundant thing with 8 CPUs, how about having a J6700 used for all the testing with a few spare scsi/fc adapters in case HP doesn't have a part available..
give it two disks for vg00, always patch it to the same level, and if disaster strikes, connect it to the external storage that keeps applications and data.


I better stop now. :)
yesterday I stood at the edge. Today I'm one step ahead.
Kent Ostby
Honored Contributor

Re: Single Points of Failure

Mark -- Staying on a single machine can also constitute a single point of failure.

If for instance you were on certain types of boxes that had a CPU failure, it could prevent system boot of that box (even if it had various other CPUs).

This can be overcome with ServiceGuard or a similar clustering product.

Best regards,

Oz
"Well, actually, she is a rocket scientist" -- Steve Martin in "Roxanne"
Steven E. Protter
Exalted Contributor

Re: Single Points of Failure

Within a server:
Power supply(Older D, K and N)
CPU in a single CPU system

System board.

SCSI cable.

Boot disk in an non-mirrored system.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
TwoProc
Honored Contributor

Re: Single Points of Failure

Mark,
OK, a simple one - you got the extra disks, and mirrordisk software - but did you get another controller card for the mirrored disks? I've seen many folks miss this step. When root is properly mirrored - the mirroring includes a separate path to the disk, card, cables, etc.

What about network cards? Got spares? I like to use HP's port aggregation software to handle this issue. It's cheap, yields greater throughput (if the demand is high enough), and supplies automatic failover.
We are the people our parents warned us about --Jimmy Buffett
Geoff Wild
Honored Contributor

Re: Single Points of Failure

On top of all of the above - don't forget the application!

IE - use clustering software like MC/SG.


Also - have a redundant network - 1 lna card to one switch and another lan card to another switch.

Rgds...Geoff
Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.
Jan van den Ende
Honored Contributor

Re: Single Points of Failure

One out of our history:

Background:

A good fire-alarm should leave ON the (light-fused) light mains, because any light there still is is helpfull for the fire-brigade. Should they hose that mains, the fuse simply blows.
But the heavy-fused 380V machine mains would be deadly.
So a good fire alarm should disconnect that.
What we did not know: in the original building design they decided that this even has to be true in a computer room, yes INSIDE the no-break area.

What happened:

In preparing for "Y2K" the building maintenance department decided they needed a test of the fire alarm system too.
Yes, they DID inform the firebrigade that there SHOULD be coming an alarm, and that that would be a test alarm.

Nobody ever thought of consulting the IT department...

We lost one complete site.
Luckily, the most vital apps were running on a multi-site cluster, and after 10 seconds of State Transition those apps continued.

AS400, Tru64, and WNT can not ( / could not at the time ) be clustered over 7 KM, and for those apps it really DID show the effects of SPOF!!

Morale:

However you think you are prepared for everything, reality always will hold some surprise!

Proost.

Have one on me.

Jan
Don't rust yours pelled jacker to fine doll missed aches.
Paul Cross_1
Respected Contributor

Re: Single Points of Failure

The previous poster mentioned the most obvious, I think. Any non-clustered environment IS a single point of failure.

However, the only thing I can add that hasn't been mentioned is, in addition to mirroring root, perform a weekly near-online copy of your boot disk with dd.
A mirror is great for availability, but when one gets corrupted, so does the other.

Managers tend to not like this idea because it requires 3 disks. Primary, mirror, and root copy, which can get expensive. This solves the human error problem mentioned above too.

This might seem a little paranoid, but it has saved much downtime for me in the past...
Florian Heigl (new acc)
Honored Contributor

Re: Single Points of Failure

That reminded me of one more thing:
there should be a classic 'boss' in the department that is in almost anything major.

we had two old clariion 4500 arrays, one was already emptied out, the other still running. someone was told by someone to switch off 'the clariion' and to be sure he ask someone different 'both of them?'. that someone was dutied with the shipping out after they're no longer used and said 'sure yes'. so someone turned both of, not even noticing the heavy accesses on one of them.
I guess You're noticing it's so many "someones".
mindless delegation is a huge SPOF.
yesterday I stood at the edge. Today I'm one step ahead.
A. Clay Stephenson
Acclaimed Contributor

Re: Single Points of Failure

One that has not been mentioned yet is only a single HVAC system. I don't even trust those systems which house redundant systems within the same enclosure. I always go for N + 1 separate HVAC units --- and these should also be able to automatically run on standby power.

I also don't trust the large UPS's but prefer that each cabinet have it's own UPS's which are in turn supplied with standby power. For components with N + 1 power supplies, the power can be drawn from adjacent cabinets as well as the local cabinet. Surprisingly, I also never configure any of the automatic shutdown scripts for loss of power. I suspect that human/software errors in the shutdown monitors greatly exceed the risk that the generator will not start -- assuming the generator tests itself weekly.

Another power related SPOF is the "Panic Button". It's rather a challenge to locate and protect this switch so that one push shuts down all power (including cabinet mounted UPS's). It should be easy enough for anyone to do it but hard enough to prevent accidental use.

If it ain't broke, I can fix that.
MarkSyder
Honored Contributor

Re: Single Points of Failure

Thanks everyone.

I think I've got enough now, but I'll leave the thread open.

Mark
The triumph of evil requires only that good men do nothing
Thierry Poels_1
Honored Contributor

Re: Single Points of Failure

hi,

something nobody mentioned: COOLING!
If your airco drops dead, the computer room temperature can go sky high in no time. Maybe you can restart it without server downtime, but it can cause issues with your servers in the long run.

And how about personnel: is every function covered by more than one person.

regards,
Thierry.
All unix flavours are exactly the same . . . . . . . . . . for end users anyway.
Steve Lewis
Honored Contributor

Re: Single Points of Failure

In the same line as above :

humidifier / dehumidifier

Depending on your geographic location, you may be able to carry on for a few hours if one of them breaks.

This brings me on to another SPOF - your spare parts supplier. I have had situations where spare parts were not in the country and had to shipped in from USA and Germany at considerable delay. So it helps to maintain relationships with a couple of resellers.



Marlou Everson
Trusted Contributor

Re: Single Points of Failure

Mark,

Make sure that the generator tanks are checked. We had a power failure and switched to the generator and then it ran out of diesel after about 10 minutes. I guess this is another procedural, human-error event.

We also had someone push the big red button that disconnects the power to the computer room. The button has a cover over it now.

We have 2 UPS systems. The alarm went off on one and a new employee (from a different department) did not know how to stop it so he turned the UPS off. The UPS room is now only accessible by a card key that we control.

Marlou