HPE 9000 and HPE e3000 Servers
cancel
Showing results for 
Search instead for 
Did you mean: 

RP8240 2 Cell setup as 1 partition - SPOFs

 
MIDAS
Occasional Advisor

RP8240 2 Cell setup as 1 partition - SPOFs

We have a RP8240 2 Cell system with the cells setup as 1 partition.

We have been asked to identify single points of failure on the system.

It is attached to an EVA4000 - but we have that one covered (securepath, dual controllers, redundant paths etc).

It is also connected to a UPS - although there is only 1 of them... so 1 SPOF IDed here already.

Question is, that if a processor/memory module was to fail on one of the cells, would the system continue to run (without a reboot / re-partitioning of cells)...

We know the cells are connected to one system backplane - so potentially a SPOF here...

Rgds
-Webteam
6 REPLIES 6
Torsten.
Acclaimed Contributor

Re: RP8240 2 Cell setup as 1 partition - SPOFs

several failures (e.g. double bit memory errors, failed system/pci backplane) can cause a system crash.

Thats why there is the service guard product for clustering applications.

see also

http://docs.hp.com/en/ha.html#Serviceguard

http://h71028.www7.hp.com/enterprise/cache/4174-0-0-0-121.html

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
MIDAS
Occasional Advisor

Re: RP8240 2 Cell setup as 1 partition - SPOFs

Thanks Torsten. Appreciate the use of Serviceguard but the sytem is running as one physical partition so we can't cluster our apps. So...... are you saying that if one of the eight processors in the two boards (four in each) was to fail or one memory module in either cell the whole system would crash or is the rp8420 clever enough to isolate the damaged cell and limp along?
Joshua Scott
Honored Contributor

Re: RP8240 2 Cell setup as 1 partition - SPOFs

In the case of an unrecoverable memory fault due to the failure of multiple bits on the memory cards themselves, there would of course be no way to recover, since the data would be corrupted.

Same with the processor, most processor failures are not the processor itself, but the cache on the die. When you have a multi-bit fault in the cache, whatever data or instructions are stored there become corrupted, and the system cannot continue.

In both these cases the system does detect when this occurs, and issues a reset to protect your data from corruption.

Fortunately, these conditions are *extremely* rare on this system, and typically, the system will detect the first bit failure and deallocate that block of memory. This is what the Page Deallocation Table was designed to do. Also, when a single-bit error is detected in the Cache on a processor, the system will attempt to deactivate the processor and continue. If the processor cannot be deactivated, the system will continue to send alerts until the processor is replaced.

Josh
What are the chances...
MIDAS
Occasional Advisor

Re: RP8240 2 Cell setup as 1 partition - SPOFs

Thanks for the replies so far. Any ideas what would happen should an entire cell (i.e. the board) fail?
Joshua Scott
Honored Contributor

Re: RP8240 2 Cell setup as 1 partition - SPOFs

The entire cell board can't fail, but components on it could. The most common failures on the cell are:

damaged memory board slot
Voltage regulator module (VRM) failure
damaged cpu slot
damaged connectors

If you are not making any changes, the only failures I've seen are the VRMs. VRMs are redundant, so if one fails, the cell can continue to operate, but if a failure is detected on boot, then the cell will halt.

Josh
What are the chances...