1825805 Members
3555 Online
109687 Solutions
New Discussion

Re: System unreachable

 
nchat504
Advisor

System unreachable

I have an rp4440 that has been in production for atleast a year. Actually, it has probably been up for that long. THe issue I am having is that today i was not able to telnet/ssh or get a response at the remote console.. When i log onto the console there was nothing in live events, System or Event logs.. However, I was able to ping the system. My last resort was to issue an RS and the system was rebooted and came up fine. Once back online, syslog has nothing significant, dmesg nothing.. so how can i figure out what caused this system to get into that state? Customer wants root cause and I do not know where else to look?

BTW, this happened last year on one of our rp3440 systems and we have not found anything either..Please help!!
18 REPLIES 18
Patrick Wallek
Honored Contributor

Re: System unreachable

After a reboot, dmesg will be useless as it would have been cleared out during the boot cycle.

However, you should have a look at /var/adm/syslog/OLDsyslog.log. That will be the file that was active when the problem occurred.

During reboot syslog.log gets moved to OLDsyslog.log, recreated and then syslogd starts.

When you connected to the console, were you able to ping other devices on the network? Did you check to see if the inetd daemon was running? Was the sshd daemon running? How busy was the system? Was there a high load at the time?

A ping indicates some modicum of network connectivity, but that's no guarantee that other things will work.

At this point, there may not be much to look at to determine root cause of the problem.
nchat504
Advisor

Re: System unreachable

OLDsyslog did not have any clues either: basically there werent any entries since Feb. 2nd and prior to that Jan. 24th.. The last 25 lines looked like this..

Jan 23 16:55:56 sapux083 SAPPRD_00[26285]: Unable to open trace file sapstartsrv
.log. (Error 13 Permission denied) [ntservsserver.cpp 1909]
Jan 24 03:35:19 sapux083 xntpd[20125]: synchronized to 10.213.255.2, stratum=3
Jan 24 03:35:51 sapux083 xntpd[20125]: synchronisation lost
Jan 24 03:37:27 sapux083 xntpd[20125]: synchronized to 10.213.255.1, stratum=3
Jan 24 03:43:51 sapux083 xntpd[20125]: synchronized to 10.213.255.2, stratum=3
Jan 24 03:44:23 sapux083 xntpd[20125]: synchronisation lost
Jan 24 03:44:55 sapux083 xntpd[20125]: synchronized to 10.213.255.1, stratum=3
Jan 27 08:51:51 sapux083 sshd[12353]: Accepted keyboard-interactive/pam for glyo
ns from 10.213.39.145 port 2156 ssh2
Feb 1 12:58:11 sapux083 syslog: gethostbyaddr: D1CNI12285PL.coxnewscop.int. !=
169.137.106.89
Feb 2 14:47:55 sapux083 syslog: gethostbyaddr: d1cls12231pl.coxohio.com. != 169
.137.104.13
Feb 2 15:12:25 sapux083 syslog: gethostbyaddr: D1ADV7899P.coxnewscop.int. != 16
9.137.105.87
Feb 2 15:12:26 sapux083 syslog: gethostbyaddr: d1mkt10645p.coxnewscop.int. != 1
69.137.105.123

when i went to the console, i could not ping or type anything, it wasnt until i rebooted that i got anything to write to the console window. This system was not being used heavily at the time.
Michael Steele_2
Honored Contributor

Re: System unreachable

HI

What's in /etc/opt/resmon/logs?
Support Fatherhood - Stop Family Law
Keith Bryson
Honored Contributor

Re: System unreachable

Hi there

One cause of this could be RAM/SWAP related. Were/are you running tight on RAM on this system and how much swap do you have configured? If you have perf tools installed you could possibly use the "extract" tool to look at processes that were running at the time. Do a trawl from the / [root] folder using "find" for any core or trace files that may have been created at the time of the "crash". Is this an Oracle RDBMS server? Which version of 11i are you using?

Let us know how you get on with your investigation.

Keith
Arse-cover at all costs
nchat504
Advisor

Re: System unreachable

total 2816
-rw-r--r-- 1 root root 60961 Feb 5 17:19 api.log
-rw-r--r-- 1 root root 500031 Feb 5 06:36 api.log.old
-rw-r--r-- 1 root root 2086 Oct 12 2008 client.log
-rw-r--r-- 1 root root 500831 Oct 12 2008 client.log.old
-rw-r--r-- 1 root sys 17977 Feb 5 13:56 emsagent.log
-rw------- 1 root sys 3598 Feb 5 13:56 emsha.log
-rw-r--r-- 1 root root 276784 Feb 5 14:59 registrar.log
-rw-r--r-- 1 root root 28078 Jun 18 2009 reslog.html


any file you interested in me checking?
nchat504
Advisor

Re: System unreachable

-------------------Start Event--------------------
User event occurred at Fri Feb 5 06:41:32.197519 2010
Process ID: 7721 (/usr/sbin/stm/uut/bin/.../dm_chassis) Log Level: Error
The chassis code monitor (dm_chassis) cannot run on this machine. Either the ma
chine does not generate chassis logs, or the machine is not supported by dm_chas
sis. Currently, the following set of machines are supported by dm_chassis:
superdome
S-class
-------------------End Event----------------------

-------------------Start Event--------------------
User event occurred at Fri Feb 5 06:46:32.682878 2010
Process ID: 7919 (/usr/sbin/stm/uut/bin/.../dm_chassis) Log Level: Error
The chassis code monitor (dm_chassis) cannot run on this machine. Either the ma
chine does not generate chassis logs, or the machine is not supported by dm_chas
sis. Currently, the following set of machines are supported by dm_chassis:
superdome
S-class


saw this in the api.log file... It was also present in the old api.log.. still checking the others..
nchat504
Advisor

Re: System unreachable

Kevin,

Thanks for your reply. Here are some of the answers

===============
No Perftools installed,no core files found using the find / -name core* command. system is used as an application server for one of the production SAP/Oracle systems.

swapinfo -t
Kb Kb Kb PCT START/ Kb
TYPE AVAIL USED FREE USED LIMIT RESERVE PRI NAME
dev 4194304 0 4194304 0% 0 - 1 /dev/vg00/lvol2
dev 26624000 0 26624000 0% 0 - 1 /dev/vg00/lvol10
reserve - 18453120 -18453120
memory 26111764 1347344 24764420 5%
total 56930068 19800464 37129604 35% - 0 -

Memory Information:
physical page size = 4096 bytes, logical page size = 4096 bytes
Physical: 33552384 Kbytes, lockable: 26060496 Kbytes, available: 29910084 Kbytes
nchat504
Advisor

Re: System unreachable

From the registrar.log file...

-------------------Start Event--------------------
Event 2960 occurred at Fri Feb 5 13:58:48.020756 2010
Process ID: 2713 (/etc/opt/resmon/lbin/registrar) Log Level: Error
process_time_event: Expired awaiting-reply object, socket=7
-------------------End Event----------------------

-------------------Start Event--------------------
Event 2937 occurred at Fri Feb 5 13:58:48.025938 2010
Process ID: 2713 (/etc/opt/resmon/lbin/registrar) Log Level: Error
abort_awaiting_reply_obj: socket=7: Connection aborted
-------------------End Event----------------------

-------------------Start Event--------------------
Event 2961 occurred at Fri Feb 5 13:58:48.026394 2010
Process ID: 2713 (/etc/opt/resmon/lbin/registrar) Log Level: Error
process_time_event: Expired contact object for monitor /usr/sbin/stm/uut/bin/too
ls/monitor/RemoteMonitor
-------------------End Event----------------------

-------------------Start Event--------------------
Event 2960 occurred at Fri Feb 5 13:59:04.015484 2010
Process ID: 2713 (/etc/opt/resmon/lbin/registrar) Log Level: Error
process_time_event: Expired awaiting-reply object, socket=7
----------------

Nothing else significant in the other files.
Michael Steele_2
Honored Contributor

Re: System unreachable

Oh, well, I guess you better install all online and diagnostics. This is your problem.

Verify with swlist | grep -i online

http://software.hp.com/portal/swdepot/displayProductInfo.do?productNumber=B6191AAE
Support Fatherhood - Stop Family Law
Keith Bryson
Honored Contributor

Re: System unreachable

Looking at your swapinfo/dmesg output, this is a 32Gb server with 30Gb (ish) swap? IF you run up glance or top, how much RAM is being reported as used? If you are hitting 90%+, I can't see that you have enough SWAP configured as HP-UX wants to reserve the same amount of RAM in swap at all times.

Let us know.
Keith
Arse-cover at all costs
nchat504
Advisor

Re: System unreachable

System: sapux083 Fri Feb 5 17:59:07 2010
Load averages: 0.02, 0.01, 0.02
187 processes: 171 sleeping, 15 running, 1 zombie
Cpu states:
CPU LOAD USER NICE SYS IDLE BLOCK SWAIT INTR SSYS
0 0.01 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
1 0.06 4.2% 0.0% 0.4% 95.4% 0.0% 0.0% 0.0% 0.0%
2 0.00 0.0% 0.0% 0.2% 99.8% 0.0% 0.0% 0.0% 0.0%
3 0.01 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
4 0.01 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 0.0% 0.0%
5 0.03 0.0% 0.0% 0.2% 99.8% 0.0% 0.0% 0.0% 0.0%
6 0.01 3.4% 0.0% 0.8% 95.8% 0.0% 0.0% 0.0% 0.0%
7 0.01 0.0% 0.0% 0.6% 99.4% 0.0% 0.0% 0.0% 0.0%
--- ---- ----- ----- ----- ----- ----- ----- ----- -----
avg 0.02 1.0% 0.0% 0.2% 98.8% 0.0% 0.0% 0.0% 0.0%

Memory: 6177968K (1233628K) real, 18980620K (3751632K) virtual, 22543108K free
Page# 1/38

CPU TTY PID USERNAME PRI NI SIZE RES STATE TIME %WCPU %CPU COMMAND
1 ? 6148 prdadm 155 20 14891M 81092K sleep 0:31 12.51 12.49 dw.sapPRD_D
6 ? 15530 prdadm 154 20 14903M 93220K sleep 2:32 1.39 1.39 dw.sapPRD_D
7 ? 2435 root 152 20 587M 96616K run 0:21 0.79 0.79 java
=============================================
using the top command
nchat504
Advisor

Re: System unreachable

sapux083 /var/adm/syslog swlist | grep -i online
B3929DA 3.5-ga15-04 HP OnLineJFS 3.5
OnlineDiag B.11.11.18.05 HPUX 11.11 Support Tools Bundle, Dec 2006
Keith Bryson
Honored Contributor

Re: System unreachable

11iv1 is going to make this a little more difficult - when was this server last patched? Have you applied the latest QPK/HWE/Feature bundles? Also, any idea what dbc_max_pct dbc_min_pct figures you have in the kernel (use sysdef or SAM to find out).

I can remember several issues with v1 and vhand (which were fixed with patches). I don't tend to rely on memory stats from "top", it's a shame you don't have glance installed. It's difficult for me to see the RSS figures for your SAP processes (it looks like 30Gb - but that may be shared memory).

I'd definitely consider making SWAP 1.5x RAM (at least).

Now, I wonder if they still have the evaluation copy of Glance somewhere.........

( 8 )
Keith
Arse-cover at all costs
Michael Steele_2
Honored Contributor

Re: System unreachable

Dec 2006

You have messages being sent an nothing listening, that is, if your firmware is up to date. IF not, then you have nothing being sent and nothing listening.
Support Fatherhood - Stop Family Law
Dennis Handly
Acclaimed Contributor

Re: System unreachable

>My last resort was to issue an RS

If you want to know why a system was hung, you need to use TC to get a memory dump. (You first need to make sure crash dumps are enabled.)

>swapinfo -t

(It would helpful next time to always use -tam.)

>Patrick: A ping indicates some modicum of network connectivity, but that's no guarantee that other things will work.

I've had that too, unfortunately. :-(

>Keith: as HP-UX wants to reserve the same amount of RAM in swap at all times.

As long as you have pseudo-swap enabled, that's a myth about device swap.
nchat504
Advisor

Re: System unreachable

dbc_max_pct=8
dbc_min_pct=5

Systems were patched in 2008

We are currently working on a patch plan to update the systems. I will look into installing glance as well.

GOLDAPPS11i B.11.11.0712.475 Applications Patches for HP-UX 11i v1, December 2007
GOLDBASE11i B.11.11.0712.475 Base Patches for HP-UX 11i v1, December 2007
HWEnable11i B.11.11.0612.458 Hardware Enablement Patches for HP-UX 11i v1, December 2006

Michael Steele_2
Honored Contributor

Re: System unreachable

Hi

Question for you: Has your application been upgraded recently with any new rollouts?

I ask because a memory leak could freeze the system.

Let me know.

Anyway, you're not going to know where the problem is until you upgrade your diags and firmware. You're probably many, many version out of date.
Support Fatherhood - Stop Family Law
nchat504
Advisor

Re: System unreachable


Not sure about the rollouts.. i know they did a database refresh a few months back.. but this is their app server, so not sure what changes were made on it.. but i will ask. In the meantime, I will take your suggestion and see what I can do to expedite the patch/firmware and diag update..

Thanks for your feedback.