1821052 Members
2221 Online
109631 Solutions
New Discussion юеВ

Re: System Hanging

 
Simon Hargrave
Honored Contributor

System Hanging

Hi, I have a K460 server, which currently has HP-UX 11.11 June 02 installed, and running Oracle 9i. It hasn't gone live yet, but periodically it just hangs. Console hangs, cannot connect to it or anything. The only thing to do is turn the key to restart it. What can I do to diagnose the problem?

There's no /var/adm/crash because it didn't crash as far as it's concerned, it's just as if the CPU hung and it couldn't dump core.

There are files in /var/tombstones, but I don't think they're valid. The last 5 in there are the same size, with the same contents. Timestamps of the files are the reboot times, but the timestamp within the file is alwats Oct 7, 1999. DOes this mean that this server doesn't correctly support PIM? I've attached the ts99 file in case there's anything of any use there.

Any ideas greatly appreciated, I'm obviously bery nervous to go live on a server that just hangs periodically.

I am going to patch it to the latest bundles, but it can go several weeks between crashes so I'm looking for ideas in case the patching doesn't fix it.

Cheers


Sy

Oh, and I have noticed that fastboot etc is enabled, so I'm going to disable this and reboot see if any hardware errors are evident.
16 REPLIES 16
T G Manikandan
Honored Contributor

Re: System Hanging

Please check the /var/adm/syslog/syslog.log file for any errors.

Revert
Rajeev  Shukla
Honored Contributor

Re: System Hanging

And the other idea would be to do a TOC and make the crash dump happen when the system hangs.
The possibilites could be either CPU problem, NFS problem(network Problem), memory problem and so on. So the next time system hangs turn the key to maintenance mode, go to GSP prompt and do a TOC so that the system does a crash dump. Analyse the crash dump or ask HP to go though the crash dump.

Cheers
Rajeev
Ravi_8
Honored Contributor

Re: System Hanging

Hi,

could you press memory modules. seems to be one module is not sitting properly.
never give up
Stefan Farrelly
Honored Contributor

Re: System Hanging


What are the status codes on the display on the front of the server when it hangs ? this is the fastest way to try to diagnose the problem. The slower way is to forceably TOC it and send the dump to HP to analyse, but you can interpret from the status codes what the likely problem is.
Im from Palmerston North, New Zealand, but somehow ended up in London...
Eugeny Brychkov
Honored Contributor

Re: System Hanging

"WARNING: Unsupported combination of memory carriers has been detected. Remove one memory carrier or install only memory carriers of the same type before booting": consult HP or K-server user/service manual. Memory installation is not siple on Ks. You have to group modules by 4 and install them from largest to smallest, and then install remaining modules (grouped by 2) in the same order: largest to smallest;
"WARNING: Link to HP-PB I/O Expansion Module failed at 8/12. Check cables and power to Expansion Module then reboot machine.... WARNING: Link to HP-PB I/O Expansion Module failed at 8/4. Check cables and power to Expansion Module then reboot machine": hardware problem;
etc, etc.
I did NOT see so BAD server before :o( But as soon as timestamp is 1999 then maybe all these are not valid.
You should: restart server, and look which warnings it displays before loading IPL. Post them here.
Please note that if you have problems with memory and you will DISABLE fastboot server may die completely (will not pass POST due to discovered memory problems).
While running hpux OS you can use STM (support tools manager, stm) to see information on processors, memory, i/o controllers, disks, tapes etc.
Please:
- attach bootup warnings to your next reply (in PDC you can see these warnings issuing 'warning' command);
- attach memory information log from stm;
- attach processor pim from stm
Eugeny
Simon Hargrave
Honored Contributor

Re: System Hanging

> I did NOT see so BAD server before :o( But as soon as timestamp is 1999 then maybe all these are not valid.

I'm fairly comfortable that they're bogus errors from many many moons ago!

> You should: restart server, and look which warnings it displays before loading IPL. Post them here.

There are no warnings displayed at all.

I have now disabled FASTBOOT, and enabled SELFTEST, but it still does not update the PIM etc (timestamp is still 1999). How can I force it to update this. I've noticed a CLEARPIM command, will this clear it and force a refresh? I've also tried booting up with the key in the "Service" position thinking this may force an update, but it hasn't, and the old messages remain.
Simon Hargrave
Honored Contributor

Re: System Hanging

Actually just noticed that the PIM output is slightly different. ( couple of diagnostic registers changed values). Presumably this is because either I booted with "service" key position, or because I've changed fastboot and selftest. However the timestamp is still exactly the same 1999 timestamp. I'm wondering if those WARNING messages are accumulated over time?

I'm going to do a CLEARPIM and reboot, see what it comes up with then.
harry d brown jr
Honored Contributor

Re: System Hanging

I'd use stm, like Eugeny has said, to "exercise" the machine.

live free or die
harry
Live Free or Die
Eugeny Brychkov
Honored Contributor

Re: System Hanging

PIM (processor internal memory) got updated only on failure - hardware or software. It is updated with CPUs status when failure occured. After rebooting OS sees that PIM data changed and saves PIM data to /var/tombstones/ts99 file. All previous ts files got shifted ts99->ts98, ts98->ts97 etc.
So if you're looking to PIM output and do not see any valid timestamps (for any processor) then PIM is not valid/old. You're right clearing PIM memory - whole output will have zeros, so if anything will hapeen you'll identify it by seeing non-zero values.
First of all update Diagnostic software (from support CD or available from www.hp.com), GR patch bundle and then HWE patch bundle.
As already was proposed, when system will hang get codes displayed on server's LCD. Then do TOC (button is located on Core I/O) and send dump to HP
Eugeny
Jim Butler
Valued Contributor

Re: System Hanging

Sy

Are you running any kind of root shells or tools such as sudo on the machine?

If so, verify that they are all built on your current box - don't bring binaries over.

Also, was your oracle installed new, or was it a backup/restore job? If the latter - then use the oracle installer to perform a re-link.

good luck
Man The Bilge Pumps!
Anthony deRito
Respected Contributor

Re: System Hanging

Dump core with a TOC. Make sure you have enough space to hold full memory contents. Have HP help analyze what is going on when the system hangs or use Q4 to diagnose yourself.
Simon Hargrave
Honored Contributor

Re: System Hanging

Okay I'm in the process of updating the diagnostics software and patching the server, and then I'll wait for it to crash and do a TOC. However, I've already done a CLEARPIM and indeed all flags have cleared. Now all the WARNING messages against the CPUs are still there. Now I'm fairly sure that they can't be current messages (eg there are messages that say FASTBOOT is enabled and selftests are disabled which they're not because I've enabled them!)

So, where are these WARNING messages stored, and how do I clear them?

Cheers Sy
Jim Butler
Valued Contributor

Re: System Hanging

go into gsp -
Cntrl-B from the console will get you in there.

Looking at the logs will clear it.

type HELP - for the menu.

be a little careful in there if it is your first time.
Man The Bilge Pumps!
Jim Butler
Valued Contributor

Re: System Hanging

I'm sorry - just looked and noticed you are on a k460

You may have to go into STM or (cstm if you don't like the gui and initialize your logs there)
Man The Bilge Pumps!
Anthony deRito
Respected Contributor

Re: System Hanging

Ctrl B then CL to clear logs if I remember correctly. For the T500 I think you have to go into SP mode before running CL but I don't think this is the case with K box.
Simon Hargrave
Honored Contributor

Re: System Hanging

I'm not seeing a CL command. Are you sure this should be available on a K class? Perhaps this is a GSP command on N, L class etc? Also I can't see where to access these errors from stm. I've hilighted the CPU and looked through all logs associated, but can't find any of the warnings that are mentioned in the tombstone.