1833847 Members
2258 Online
110063 Solutions
New Discussion

Server down problem

 
SOLVED
Go to solution
yyghp
Super Advisor

Server down problem

One of our server was down last night, had to reboot to make it work this morning.

Here's what I got from GSP:

GSP> sl


SL

Select Chassis Code Buffer to be displayed:
Incoming, Activity, Error, Current boot or Last boot? (I/A/E/C/L) e
e

Set up filter options on this buffer? (Y/[N])


The first entry is the most recent Chassis Code
Type + CR and CR to go up (back in time),
Type - CR and CR to go down (forward in time),
Type Q/q CR to quit.


Log Entry # 0 :
SYSTEM NAME: srs068rib
DATE: 12/02/2004 TIME: 08:00:02
ALERT LEVEL: 13 = System hang detected via timer popping

SOURCE: 1 = processor
SOURCE DETAIL: 1 = processor general SOURCE ID: 0
PROBLEM DETAIL: 4 = timeout

CALLER ACTIVITY: F = display_activity() update STATUS: 0
CALLER SUBACTIVITY: 00 = implementation dependent
REPORTING ENTITY TYPE: E = HP-UX REPORTING ENTITY ID: 00

0x78E000D41100F000 00000003 00000001 type 15 = Activity Level/Timeout
0x58E008D41100F000 0000680B 02080002 type 11 = Timestamp 12/02/2004 08:00:02
Type CR for next entry, Q CR to quit.



Log Entry # 1 :
SYSTEM NAME: srs068rib
DATE: 12/02/2004 TIME: 08:00:05
ALERT LEVEL: 2 = Non-Urgent operator attention required

SOURCE: 0 = unknown, no source stated
SOURCE DETAIL: 0 = unknown, no source stated SOURCE ID: FF
PROBLEM DETAIL: 0 = no problem detail

CALLER ACTIVITY: 6 = machine check STATUS: 2
CALLER SUBACTIVITY: 10 = implementation dependent
REPORTING ENTITY TYPE: 0 = system firmware REPORTING ENTITY ID: 00

0x0000002000FF6102 00000000 00000000 type 0 = Data Field Unused
0x5800082000FF6102 0000680B 02080005 type 11 = Timestamp 12/02/2004 08:00:05
Type CR for next entry, - CR for previous entry, Q CR to quit.


I got NO record for the past 2 weeks in the /var/adm/syslog/OLDsyslog.log
( The "syslog.log" has the records for the new start after I reboot it this morning )

How can I know what was going on with the server last night ? and What can I do next ?

Thanks !
5 REPLIES 5
RAC_1
Honored Contributor

Re: Server down problem

Check for board/cell connections. CPU connections. Reseat them.

Also from GSP do a check on poer supplies status. (I think command is ps)

Anil
There is no substitute to HARDWORK
Steven E. Protter
Exalted Contributor
Solution

Re: Server down problem

ALERT LEVEL: 13 = System hang detected via timer popping

timer popping is usually a non-serious ignorable message. I've never seen a system hang from this.

Do you have crash save configured?

If so, there may be a crash in /var/adm/crash

If not, set up for the next one.

Change the first variable in /etc/rc.config.d/savecrash to 1

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Sanjay_6
Honored Contributor

Re: Server down problem

Hi,

Take a look at this doc,

http://www1.itrc.hp.com/service/cki/docDisplay.do?docLocale=en_US&docId=200000072401028

The doc id is KBRC00012411.

/quote/

For the vast majority of cases, that error is a "bogus" error, although it can be somewhat corrected by the latest GSP firmware and latest diagnostics.

Superdomes, Keystones, Matterhorns, N- and L-class machines all seem to have this minor fault at times.

/Endquote/

A crash dump analysis, if you have one may help. Try this doc on how to do a crash dump analysis and forward the same to hp for more help,

http://www1.itrc.hp.com/service/cki/search.do?category=c0&mode=id&searchString=OZBEKBRC00000611&searchCrit=allwords&docType=Security&docType=Patch&docType=EngineerNotes&docType=BugReports&docType=Hardware&docType=ReferenceMaterials&docType=ThirdParty&search.x=18&search.y=7

the doc id is OZBEKBRC00000611.

Hope this helps.

Regds
Ryan McKlveen
Advisor

Re: Server down problem

I agree - think that's somewhat misleading info on the GSP information - I'd look at /etc/shutdown log - if you have crash configured as mentioned before - anything in tombstones? send the ts99 file out for analysis? HTH
Scot Bean
Honored Contributor

Re: Server down problem

If this happens again, apparent hang, consider doing a TOC reboot.

GSP cmd "TC".

This will force a memory dump.