HPE 9000 and HPE e3000 Servers
cancel
Showing results for 
Search instead for 
Did you mean: 

rp8400 PCI power supply problems

 
SOLVED
Go to solution
Ryan_11
Advisor

rp8400 PCI power supply problems

Hi,
We have an rp8400 where both the PCI power supplies failed at different times. HP replaced these saying there is a service note where there is a defective diode in certain supplies that has now been upgraded.

My question is: How come these two PCI supplies are not redundant? Why do they exist at all? When one of these fails one half of the chassis dies resulting in a dead system.

We have 5 bulk power supplies with one partition running over three cells.

When a PCI power supply fails the whole partition fails - this to me is a serious design flaw.
The PCI supplies are a single point of failure negating the redundancy of the multiple bulk power supplies?

When the failure occured, I moved my DB back on to my old rp7410, and now I am absolutely terrified of moving my production DB back on to the rp8400 because of this issue.

Any thoughts?

Thanks in advance
Ryan
6 REPLIES 6
Iain Ashley
Trusted Contributor

Re: rp8400 PCI power supply problems

Hi Ryan,
The PCI power supplies are an aknowledged single point of failure for a partition with a single cardcage, NOT for the complex. There needs to be two so that each PCI backplane is on a seperate power domain. This allows one cardcage to be powered off and service performed on one partition while the other runs on oblivious. The bulk powersupplies however power the entire complex, including the PCI power supplies so they do provide redundancy.

The issue with the old revision of PCI power supplies has been resolved and the RP7410 is based on the same cell technology, with the same PCI power supplies so I can't see the advantage of moving your database. If you had a non cell based system such as an rp7400, you would also be in the same boat as a PCI failure would also leave you with a downed system.

If you are using three cells in a single partiton, you should have cells 0 and 1 connected to seperate cardcages. To get reundancy you will need two MP's, duplicate interfaces across the cages and use software to provide failover. For example, LVM, VXVM, with or without AutoPath for filesystems, and Auto Port Aggregation software for LAN. If you do this the system should continue without a problem.

Another HA solution would be to split your database and use service guard to provide failover to your Rp7410, or add another cell to the RP8400 and cluster across two partitions. This would allow you to perform upgrades or service with a minimal outage.

Regards
Iain Ashley
Ryan_11
Advisor

Re: rp8400 PCI power supply problems

Thanks Iain,

My apologies, I said my old machine was a rp7410, it is actually a rp7400.

Anyway, on the rp8400, we have 2 FC network cards ( 1 in each PCI cardcage) and 4 FC host adaptors ( 2 in each cardcage) that directly connect to a XP128. Auto and alternate pathing is enabled for all the VG's on this array. One partition on the system.
How can I check which cardcage cell 0,1 and 2 are connected to?

We only have one MP however, would this be why the machine reboots when power to one of the PCI power supplies is lost?

The rp7400 is destined for our Disaster recovery site, so I can't use it too much longer.

Thanks
Ryan
Highlighted
Iain Ashley
Trusted Contributor
Solution

Re: rp8400 PCI power supply problems

Hi Ryan

Cell 0 and cell 1 are connected to io chassis 0 & 1 respectively (unlike the superdome where you can move the RIO cables more or less at will). You can map your io back with the
"rad" command:

rad -N

This will output in format.

If you look at the core io, you will see that although there are seperate buses for root disks they share an LBA. Now while the LBA for the core io is located on the system backplane, it still interfaces with the SBA located on the corresponding io backplane. Thus if you lose that PCI PS you will also lose core io, and the system will crash and not come back up ... Unless you have a second cell with a core io.

With a second core io (MP) you can mirror your root volume across, and the system will reboot as long as your
alt paths are set correctly. At this point, the partition will still HPMC if you lose the primary core io. Full support for dual core io may be available in a future firmware release. It is already supported on the superdome.

As I mentioned previously though, the issue with the PCI PS has been resolved with the new version, and psu's are generally have a very low failure rate.

So to get the best redundancy from your system, I would suggest a second mp and mirror across. For your other storage, you can check for your failover paths if you are using autopath with the command:

autopath display

This will tell you what is going on with your storage. If not, you will have to manually ensure that all luns are available on hba's in both card cages. Because you have four HBA's perhaps for ease of admin, map half the luns to each in one card cage and then map the same luns to it's counterpart in the second card cage. This way you have only two paths to manage.

For network, if not already, there is the Auto Port Aggregation software, which can provide multiple ip's, failover, load balancing and trunking.

Regards
Iain Ashley
Ryan_11
Advisor

Re: rp8400 PCI power supply problems

Thanks for your time Iain,

In your last post you say:
"With a second core io (MP) you can mirror your root volume across, and the system will reboot as long as your
alt paths are set correctly. At this point, the partition will still HPMC if you lose the primary core io. Full support for dual core io may be available in a future firmware release. It is already supported on the superdome."

Now I appreciate that the PCI power problem has been resolved and I am quite happy that we should not experience another failure - but my management are a bit harder to convince :)

If I understand correctly - even if we have a second MP (core io), and a PCI PSU fails, the system will still reboot?

I can already boot off my alt boot PATH, which has been mirrored.(even if one of the PCI power supplies is powered off) All my data disks have one primary and 3 alt paths.
I am unfamiliar with the autopath cmd and cannot find it on my system?

If the system will still reboot after a failure, what advantage will the second MP give? Looks like the superdome is a better bet.

The problem I experienced with the auto aggregation software, was that by using the combined software MAC address, the machine could no longer talk through the firewall, because it checks IP/MAC combination. ( The machine interfaces over an IPSEC line with a webserver to serve customer enquiries on our DB)

Regards
Ryan

Iain Ashley
Trusted Contributor

Re: rp8400 PCI power supply problems

Hi Ryan,

It's unfortunate, but with any catastrophic hardware failure, Unix is likely to HPMC. This is to follow the number one rule when error handling, "Maintain data integrity". So an HPMC will allow the system to come back up with a known set of hardware again. It is up the the system architect/admin to make sure that the system is capable of doing so in as many situations as possible.

I am afraid I have misled you slightly, so I will try to put things straight. An rp8400 does not support dual active core IO's. This means that is there are two MP's in a partition, only one will be active as the master on the core cell. If this MP fails the system will come back up with the second MP as master and it's cell will become core. The superdome does, however this does not mean that it will not HPMC on loss of an HIOB, in all likleyhood it would, but the core io is dual active allowing root disks to remain on the core io interfaces.

What you are looking for is a fully fault tolerant system, you will not get this from any non-clustered environment on the market. You would need to go to something like a tandem for that. You can provide yourself with a solution that is fault tolerant without the expense of a Tandem, though with some careful planning. Not knowing your environment, I am not in a position to advise in detail as to what you should do specifically, but I can provide some guidelines.

Fault tolerance requires the elimination of SPOFs. Every box in the world has at least one, so that means two boxes. You could then use Service Guard to cluster those boxes. Because you have problems with your firewall and MAC/IP spoofing, you could use something like an Alteon application switch which provide load balancing as well as transparent layers 2 through 7 switching. My suggestion would be to speak to your local HP rep and nut out a solution within your environment's constraints.

Autopath is a command that is part of the AutoPathXP software package. It requires an additional license.

Regards
Iain Ashley
Ryan_11
Advisor

Re: rp8400 PCI power supply problems

Thank You
I am in the process of discussing our options further with HP, and you have provided me with some good points to raise.

Regards
Ryan