- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- Single Points of Failure
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-20-2005 08:20 PM
01-20-2005 08:20 PM
Following my proposed bid for extra discs and mirrordisk, I've been asked what other single points of failure there are in the system. I came up with two:
1. PSU on the server
2. PSU on the disc array.
I made the point that both of these are quicker and easier to rectify than a disc crash. I was asked to check if there are any more. Points for any point of failure that I haven't thought of.
Mark Syder (like the drink but spelt different)
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-20-2005 08:33 PM
01-20-2005 08:33 PM
Solutionstarting from the most extreme:
1. Grid failure
2. Server room failure
3. Rack cable failure
4. PSU server
5. CPU failure
On the disk and connection side:
1. PSU disk
2. I/O card failure
2. Cable failure
If on a san:
1. Network card failure
2. comms failure
The rule we use is to eliminate at least the top 80% of failures i.e. standby PSU and UPS,dual cards and connections.
And hope nothing ever happens ;-)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-20-2005 08:34 PM
01-20-2005 08:34 PM
Re: Single Points of Failure
NIC card.
Do you have a spare console/keyboard/mouse or a UPS? You can always rlogin in to the server, but if you need to use single user mode a spare console is useful (you can always borrow one of another server).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-20-2005 08:53 PM
01-20-2005 08:53 PM
Re: Single Points of Failure
From my experience:
- disks failure (the most commons);
- adm or user failure (oops);
- failures due to overheating, normally disks are the firsts to fail
Eric Antunes
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-20-2005 08:55 PM
01-20-2005 08:55 PM
Re: Single Points of Failure
so trivial that it usually tends to be forgotten, but it is yhe most dangerous, and probably most often-occurring:
human error.
Can be subdivided into:
- ignorance
- clumsiness (from accidentallly hitting a power switch to miss-typing commands while privileged)
- sabotage (ie, willingly executing any detrimental activity, for whatever reason)
The chances thereoff can be diminished by restricting physical access, restricting privileges, limiting scope of dangerous commands,...
but in the end it somehow will come down on safe procedures, and plain & simple trust.
Proost.
Have one on me.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-20-2005 08:58 PM
01-20-2005 08:58 PM
Re: Single Points of Failure
Grid failure: as in National Grid?
Rack cable failure: does this mean a cable in the server cabinet?
Please excuse the absence of points at this stage - I'm concerned that people will stop reading the thread if it has a bunny on it too soon. Points to be awarded when I think the subject has been adequately covered.
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-20-2005 09:11 PM
01-20-2005 09:11 PM
Re: Single Points of Failure
For example, the Superdome has N+1 fans, power supplies, etc to rule out SPOF's, but still has three on it, namely the backplane, the UGUY and a thrid board that is completely passive, but if it WERE to fail.....
On older systems, they did not have n+1 fans etc.
Human "error" is certainly a SPOF, as is the actual application software you use ;-]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-20-2005 09:12 PM
01-20-2005 09:12 PM
Re: Single Points of Failure
Someone nearby putting a JCB bucket through a buried power cable resulting in a local power failure. Advice - UPS and/or generator
The fire alarm going off resulting in Halon gas being dumped in the computer room, due to H&S rules no one was allowed into the computer room for 24 hours - there wasn't a fire it was a £5 sensor that malfunctioned.
Availablity/contactabilty of qualified staff in the event of a problem. Advice - have clear procedures (and practice them) and contact details (periodically check them)
The power went off, the generator kicked in, but the delay caused the systems to all crash. Advice - use a UPS as well as a backup generator for brown-outs.
The power went off, the computers were OK, they were on UPS, but the lights weren't. It's quite difficult operating in a large server room in the semi-dark. Advice - don't underestimate the power of a torch!
An operator called out support staff in the middle of the night, the computer had frozen. What had actually happened was someone had typed CTRL-S on the console, it was a long trip to simply type in CTRL-Q. Advice - a little education goes a long way.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-20-2005 09:32 PM
01-20-2005 09:32 PM
Re: Single Points of Failure
You may also consider the backplane. But, if you are that paranoid you should consider a 2nd server.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-20-2005 09:55 PM
01-20-2005 09:55 PM
Re: Single Points of Failure
yes, grid failure as in national grid. We lost power ONCE to the whole building during a storm as lighting struck nearby. Since then we have standby generators on site!
The rack cable failures was caused by bad installation and a resulting short due to fraying.
Pete
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-20-2005 10:41 PM
01-20-2005 10:41 PM
Re: Single Points of Failure
Along the lines of what Peter was saying, it's good practice to distribute power inputs between source circuits wherever you have redundant power supplies. This helps insure against the outage of a single circuit taking down the system. Also, careful planning of the loading of each source circuit is important.
Education of the staff is important here, as well. When adding components to a system, empty outlets on the PDU's are like extra checks in the checkbook... "my account can't be overdrawn -- I still have checks!" ...don't overload the circuits.
Best Regards,
Dave
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-21-2005 12:37 AM
01-21-2005 12:37 AM
Re: Single Points of Failure
Power !!! as in
- 360V plug fiddled togheter with a modem cable, someone pulled it.
- Lan switches not plugged into UPS
- systems only connected to UPS
- having no generator (that's not so bad, just ensure You have time for shutting down)
A reasonable system without wasting money would have two PSUs, each connected to a surge protector followed by an UPS.
The UPS can be shared by multiple systems (thus the surge protector).
It's two UPS because here in Germany the national grid has a much lower chance of failure than an mid-range UPS, but in case of a sudden failure UPSs happen to fail or blow up inside after a moment.
The power cables must be secured it both ends to avoid accidential pull outs.
also manually control their fit.
- IO backplanes:
If You have two network adapters, FC-adapters or such, put them into different IO backplanes if the system supports it (>=N-Class). (Applies to the next thingy, too)
- SCSI Adapters:
Are Your mirrored disk split over two separate controllers? They should be.
- Filesystem corruption:
fsck /stand every few months - it's an unhappy event to notice that after patching.
- NFS non-bg mounts:
Any request to a failed server will lock the clients
(i.e. nfs server has proc table full, thus rpcbind is still up and running appearingly fine, but a client won't get any data out of it)
- LVM Mirror / sdisk driver:
the following happened:
FC-Array locks up with outstanding IO from an application while one disk in vg00 was failing (motor problem, it got lost from the vg00 and got back for hours).
Result (both after reboot -h and resolving the fc-issue):
one failed disk for the bin and a corrupted mirror disk.
(Fun!)
- SCSI conection to the disk array?
use pvlink and dual-active controllers - btw: psu fail in the disk array can be trouble, what happens to Your applications if the disks are gone?
- LAN connection without APA
Orginisational stuff that becomes a SPOF when something is down:
- No easy to find tape drive and tapes at hand
If someone feels he'd better do a backup he should be able to do it.
- Lack of authority and support
The person in front of the system must be allowed to call for help and have the numbers. If he can't reach anyone, he should look right at a sheet of paper with a few things on it:
the persons responsible for the system
(not in organizational character, like who was billed for it, but the guy that knows what applications are on it)
there always should be at least two, and they *MUST* be aware they are responsible for their decisions, otherwise they'll just put it off till next day.
the HP support handle
the HP telephone number
the location of the system documentation
rough description what he is authorized to do and what hp is authorized to do
(i.e. if Your org takes the risk of hot-plugging a scsi tape for emergency backup. otherwise hp won't do it)
(i.e. a billing system is better off if it's down than with lost data)
Things that kill a system anyway and sometimes don't even result in TOC:
CPU failures (usually the cpu get's unconfigured, but if it has irq's allocated: down she goes)
IO backplane errors (bad soldering etc)
happened on a few n-classes over time
bad bad bad disk blocks in paging area.
(mediainit is Your friend sometimes)
Loose RAM sockets
(only saw that one in A180's)
faulty scsi cabling (just don't have it)
most important:
get enough authority do test problems and... downtime, maybe a whole weekend, maybe plan the test with a consultant from hp.
do full backups, test them on a second system.
remove a power supply. remove pull out the whole vg00, fsck up the fc-zoning to external storage.
document what breaks everything, and what doesn't.
practice
have a backup system at hand. i.e. only use external storage, and have a small fallback system.
i.e. if Your production host is a fully redundant thing with 8 CPUs, how about having a J6700 used for all the testing with a few spare scsi/fc adapters in case HP doesn't have a part available..
give it two disks for vg00, always patch it to the same level, and if disaster strikes, connect it to the external storage that keeps applications and data.
I better stop now. :)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-21-2005 12:59 AM
01-21-2005 12:59 AM
Re: Single Points of Failure
If for instance you were on certain types of boxes that had a CPU failure, it could prevent system boot of that box (even if it had various other CPUs).
This can be overcome with ServiceGuard or a similar clustering product.
Best regards,
Oz
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-21-2005 01:23 AM
01-21-2005 01:23 AM
Re: Single Points of Failure
Power supply(Older D, K and N)
CPU in a single CPU system
System board.
SCSI cable.
Boot disk in an non-mirrored system.
SEP
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-21-2005 01:48 AM
01-21-2005 01:48 AM
Re: Single Points of Failure
OK, a simple one - you got the extra disks, and mirrordisk software - but did you get another controller card for the mirrored disks? I've seen many folks miss this step. When root is properly mirrored - the mirroring includes a separate path to the disk, card, cables, etc.
What about network cards? Got spares? I like to use HP's port aggregation software to handle this issue. It's cheap, yields greater throughput (if the demand is high enough), and supplies automatic failover.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-21-2005 01:48 AM
01-21-2005 01:48 AM
Re: Single Points of Failure
IE - use clustering software like MC/SG.
Also - have a redundant network - 1 lna card to one switch and another lan card to another switch.
Rgds...Geoff
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-21-2005 02:06 AM
01-21-2005 02:06 AM
Re: Single Points of Failure
Background:
A good fire-alarm should leave ON the (light-fused) light mains, because any light there still is is helpfull for the fire-brigade. Should they hose that mains, the fuse simply blows.
But the heavy-fused 380V machine mains would be deadly.
So a good fire alarm should disconnect that.
What we did not know: in the original building design they decided that this even has to be true in a computer room, yes INSIDE the no-break area.
What happened:
In preparing for "Y2K" the building maintenance department decided they needed a test of the fire alarm system too.
Yes, they DID inform the firebrigade that there SHOULD be coming an alarm, and that that would be a test alarm.
Nobody ever thought of consulting the IT department...
We lost one complete site.
Luckily, the most vital apps were running on a multi-site cluster, and after 10 seconds of State Transition those apps continued.
AS400, Tru64, and WNT can not ( / could not at the time ) be clustered over 7 KM, and for those apps it really DID show the effects of SPOF!!
Morale:
However you think you are prepared for everything, reality always will hold some surprise!
Proost.
Have one on me.
Jan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-21-2005 02:21 AM
01-21-2005 02:21 AM
Re: Single Points of Failure
However, the only thing I can add that hasn't been mentioned is, in addition to mirroring root, perform a weekly near-online copy of your boot disk with dd.
A mirror is great for availability, but when one gets corrupted, so does the other.
Managers tend to not like this idea because it requires 3 disks. Primary, mirror, and root copy, which can get expensive. This solves the human error problem mentioned above too.
This might seem a little paranoid, but it has saved much downtime for me in the past...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-21-2005 02:28 AM
01-21-2005 02:28 AM
Re: Single Points of Failure
there should be a classic 'boss' in the department that is in almost anything major.
we had two old clariion 4500 arrays, one was already emptied out, the other still running. someone was told by someone to switch off 'the clariion' and to be sure he ask someone different 'both of them?'. that someone was dutied with the shipping out after they're no longer used and said 'sure yes'. so someone turned both of, not even noticing the heavy accesses on one of them.
I guess You're noticing it's so many "someones".
mindless delegation is a huge SPOF.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-21-2005 02:34 AM
01-21-2005 02:34 AM
Re: Single Points of Failure
I also don't trust the large UPS's but prefer that each cabinet have it's own UPS's which are in turn supplied with standby power. For components with N + 1 power supplies, the power can be drawn from adjacent cabinets as well as the local cabinet. Surprisingly, I also never configure any of the automatic shutdown scripts for loss of power. I suspect that human/software errors in the shutdown monitors greatly exceed the risk that the generator will not start -- assuming the generator tests itself weekly.
Another power related SPOF is the "Panic Button". It's rather a challenge to locate and protect this switch so that one push shuts down all power (including cabinet mounted UPS's). It should be easy enough for anyone to do it but hard enough to prevent accidental use.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-23-2005 09:08 PM
01-23-2005 09:08 PM
Re: Single Points of Failure
I think I've got enough now, but I'll leave the thread open.
Mark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-24-2005 12:14 AM
01-24-2005 12:14 AM
Re: Single Points of Failure
something nobody mentioned: COOLING!
If your airco drops dead, the computer room temperature can go sky high in no time. Maybe you can restart it without server downtime, but it can cause issues with your servers in the long run.
And how about personnel: is every function covered by more than one person.
regards,
Thierry.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-24-2005 01:07 AM
01-24-2005 01:07 AM
Re: Single Points of Failure
humidifier / dehumidifier
Depending on your geographic location, you may be able to carry on for a few hours if one of them breaks.
This brings me on to another SPOF - your spare parts supplier. I have had situations where spare parts were not in the country and had to shipped in from USA and Germany at considerable delay. So it helps to maintain relationships with a couple of resellers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-25-2005 07:26 AM
01-25-2005 07:26 AM
Re: Single Points of Failure
Make sure that the generator tanks are checked. We had a power failure and switched to the generator and then it ran out of diesel after about 10 minutes. I guess this is another procedural, human-error event.
We also had someone push the big red button that disconnects the power to the computer room. The button has a cover over it now.
We have 2 UPS systems. The alarm went off on one and a new employee (from a different department) did not know how to stop it so he turned the UPS off. The UPS room is now only accessible by a card key that we control.
Marlou