HPE 9000 and HPE e3000 Servers
1748170 Members
4182 Online
108758 Solutions
New Discussion юеВ

Re: HP-UX 10.20 server crashed ... but why?

 
SOLVED
Go to solution
Rich Fink
Occasional Advisor

HP-UX 10.20 server crashed ... but why?

Hi all,

Weird problem here. Saturday night, one of our production servers (K460 running 10.20 - yeah, I know..) crashed. I was unable to log in remotely, even via the console. Just a blank screen.

Came in to the office after a 90 minute drive, and the server display showed:

INIT CBF7
TRAPS CPU0123

All I could find in the book on that error was "Entering PDC IO".

The system was hung, so I manually turned the key to 'standby', then back to 'service'. The system booted normally the first time, and has been running flawlessly since. (over 3.5 days now)

Thus far, I've been unable to determine the root cause. Syslog shows no entries for the preceding 17 hours. Dmesg shows nothing of interest, and nothing was written to /var/adm/crash. There is a ts99 file (attached), but unless I'm missing something, I don't see a reason for the crash, or failure to automatically reboot.

Any suggestions or ideas would be appreciated. Thanks.

-Rich
"UNIX is a user-friendly Operating System .. it's just picky about choosing its friends."
4 REPLIES 4
Bill Hassell
Honored Contributor

Re: HP-UX 10.20 server crashed ... but why?

Hi Rich,
You almost always need a crash dump to see what is happening. For 10.20, make sure you have the crash file configured in /etc/rc.config.d/savecrash:

SAVECRASH=1
SAVECRASH_DIR=/var/adm/crash

Note that unless you overwrite the dump area (often shared with swap), the crash dump may still be viable. Just run the command:

savecrash -z

and if it can find a clean crash dump, it will save it. Then you can run q4 to get a better idea of what happened.

For the K-class machines, this is an invaluable manual:

http://ftp.parisc-linux.org/docs/platforms/A2375-90004.pdf

In the back are all the chassis codes.

From all your ts99 HPMC chassis codes:

0x20b1 HPMC data cache parity fault in tag
0x5008 Processor Memory bus broad fault
0x5108 "
0x5208 "
0x5308 "
0x5408 "
0x5508 "
0x7d09 Single bit memory fault (HPMC)
0x7f14 1 = memory carrier number, 4 = SIMM pair slot number
0xcbf0 High Priority Machine Check occurred
0xcbfb Branching to the OS HPMC handler


So it looks like a fatal memory failure, carrier 1, slot 4 (both 4a and 4b)
Note that some memory problems can be logged, but if this occurred in part of the kernel space, a panic is the only choice. syslog can never tell you anything about a crash because the OS stops running. Only the panic code can do anything (like writing out the crash dump).

Since it is running now, use this command to look at the memory:

echo "selclass qualifier memory;info;wait;infolog" | cstm


Bill Hassell, sysadmin
Stefan Stechemesser
Honored Contributor
Solution

Re: HP-UX 10.20 server crashed ... but why?

Hi,

indeed this K460 had one ore more single bit errors on Memory Carrier 1, SIMM Pair 4A/4B, but:

The K460 has ECC protected memory and single bit errors are corrected in realtime and do not cause a system crash.

The cause of the crash was a data cache error of Processor 1 (counted 0,1,2,3).
The CPU caches of these old HP9000 servers are (in contrast to new systems) NOT ECC protected. => single bit cache errors cannot be corrected and always lead to a HPMC and system crash.

Something like this can happen on a CPU cache accidentaly. I would do nothing unless the same cache error happens frequently.
You can see the cache error by
1.) a valid & actual timestamp in the ts99 file
2.) a chassis code beginning with 0x2...

In your case:
----------------- Processor 1 HPMC Information ------------------

Timestamp = Sun Apr 20 00:09:39 GMT 2008 (20:08:04:20:00:09:39)

HPMC Chassis Codes = 0xcbf0 0x20b1 <=== (the rest of the chassis codes can be ignored)


best regards

Stefan
tkc
Esteemed Contributor

Re: HP-UX 10.20 server crashed ... but why?

this is a 4 cpu K460 system, i.e. cpu 0, 1, 2 & 3. cpu 1 is the cause of the problem. if you are not planning to replace the cpu for this moment, i would suggest you swap cpu 1 with cpu 2 or 3. in the future, should a similar crash occured again, you should see the problem moved to cpu 2 or 3. that will confirm the problem you had recently was due to cpu 1.
Rich Fink
Occasional Advisor

Re: HP-UX 10.20 server crashed ... but why?

Hi all,

Sorry for the delay in replying - I was out of town for a few days.

Thanks for the decoding help and suggestions. We're still up and running (8 1/2 days), thankfully.

Since it only happened once, and is still running, I plan on leaving it alone until our next scheduled maintenance. Then I'll bring it down, swap out cpu1, and go from there. (we have plenty of spares) Of course, if she panics again before that, I'll swap out the cpu before the reboot.

Thanks again for the help, and the pointer to the manual online! Points to be assigned shortly.

-Rich
"UNIX is a user-friendly Operating System .. it's just picky about choosing its friends."