HPE 9000 and HPE e3000 Servers
cancel
Showing results for 
Search instead for 
Did you mean: 

Faulty processor on a 2 CPU rp2405

 
Gary Cooper_1
Esteemed Contributor

Faulty processor on a 2 CPU rp2405

Our 2 CPU rp2405 is our main production server. It runs 24x365 and on the 1st of January, it mysteriously rebooted itself.

I spoke to HP and they said that one of the processors was faulty and wanted to arrange it's replacement. Now as it runs 24x7, it's pretty difficult to take it down to do that sort of maintenance.

We've got a shutdown scheduled for May, but we'd like to try and get a feel for if the problem may occur again.

Is there any sort of (online) diagnostics that I could run that would indicate if we are likely to have a similar episode in the near future.

I've got Online Diagnostics installed (Dec 2006) and have had a look at xstm. For the processors, it allows me to 'Exercise' them (they were both OK), but the 'Diagnose' option is greyed out. When I double-click one of the CPUs it says that the diagnostic tool isn't installed, but I installed the Dec 2006 Online Diagnostics this morning!

So, my questions are:
1) How thorough a test is the 'Exercise' option on the CPU item?
2) What do I need to do to be able to run 'Diagnose' on the CPU?
3) How thorough a test would 'Diagnose' be?
4) I found a document entitled "Dynamic Processor Deallocation" - http://docs.hp.com/en/diag/dynamic.pdf which implies that I can deallocate one of my processors, so that it can't cause the system to panic and crash. The dynamic deallocation doesn't seem to have happened. How can I deallocate a processor manually? (It sounds like I can't if it's the monarch CPU, i.e. the the processor upon which the HP-UX kernel is running.)
How do I tell which is the monarch CPU?

Urgent help would be greatly appreciated.

Thanks,

Gary
4 REPLIES 4
Highlighted
Gary Cooper_1
Esteemed Contributor

Re: Faulty processor on a 2 CPU rp2405

I'm sure there must be some bright sparks (or Olympians) out there that can answer this one...

Thanks in advance of your shared wisdom.

Gary
Patrick Wallek
Honored Contributor

Re: Faulty processor on a 2 CPU rp2405

Are both CPU's still active or just one of the 2?

If both are still active, then there is always a chance that something could trigger the error again. If only one processor is active, then the one that failed will definitely NOT cause you a problem again. However, if the other CPU fails for some reason then you are really in trouble since you now have zero active CPUs.

The only way I know of to "deallocate" a processor is via BCH menu, which requires the server to be rebooted.
Andrew Rutter
Honored Contributor

Re: Faulty processor on a 2 CPU rp2405

gary,

If the server is still running ok, and the exercise tests run ok, then it could be that the cpu's are actually ok. It could have been just a hung process that caused the reboot, for some reason, or some IO problem.

The exercise tests are quite good although they are only a 10min test and verify the cpu is working, not like a server with many processes running.

The Diagnose option and other cpu tests along with many other tests within STM are passworded and can only be run by HP engineers, unless you can get it.

I would get HP in and get them to run tests on the server.

The only other diagnostic type software is EMS, part of the STM diagnostics, which can be configured to mail/page when a problem is occuring.

It may be worth considering updating to a service guard configuration if its this critical, to ensure you keep running. Waiting from Jan to may is a long while, with potential problems hanging over.

Andy
Bill Hassell
Honored Contributor

Re: Faulty processor on a 2 CPU rp2405

Unlike the sci-fi movie 2001: A Space Odessey, predicting a CPU failure is virtually impossible. The reason is that any tests that exercise the CPU will cause the error and the machine will crash and reboot again. And you can't deallocate the processor on this system online, only by rebooting and interacting with the processor ROMs.

The rp2405 is a great computer but if you are running 24x7 with virtually no down-time allowed, you have the wrong configuration. This processor failure may have been due to a single CPU cache memory access that failed or any of thousands of intermittent failures. When and if a CPU goes bad, your system will be down for a long time (hours, maybe days). A hardware failure will prevent the system from booting up at all so you have to call HP, wait for the service engineer to arrive and repair the unit, then hope that the failure did not corrupt your data on the disk and possibly require a restore or even a reinstall.

For such a critical system, you need a second rp2405 a shareable disk storage cabinet and MC/Service Guard software. With this configuration, a processor (or memory or LAN, etc) failure will transfer the applications to the backup system. Now repair of the failing system can take place without interruption of the applications. Additionally, you can patch one system while the other one is running.

I would not worry about this one reboot but instead concentrate on what is really required for 24x7 operations.


Bill Hassell, sysadmin