Operating System - HP-UX
1834051 Members
2480 Online
110063 Solutions
New Discussion

Re: strange service guard problem

 
wapper_1
New Member

strange service guard problem

Hi:

We got 2 rp4440 with HPUX 11.11 and MC version 11.14, after it works for 4 months normally, some strange thing happened:

on 1sr server, syslog.log show the following info:
...... skip ........
Aug 21 17:46:28 omc1scs1 cmcld: HB connection to 172.168.0.2 not responding, closing
Aug 21 17:46:28 omc1scs1 cmcld: GS connection to 172.168.0.2 not responding, closing
Aug 21 22:14:44 omc1scs1 named[611]: zone oss.nmuni.com/IN: refresh: unexpected rcode (REFUSED) from master 172.168.0.5#8054
Aug 21 23:30:30 omc1scs1 cmcld: HB connection to 172.168.0.2 is responding
Aug 21 23:30:30 omc1scs1 cmcld: GS connection to 172.168.0.2 is responding
Aug 21 23:52:30 omc1scs1 cmcld: HB connection to 172.168.0.2 not responding, closing
Aug 21 23:52:30 omc1scs1 cmcld: GS connection to 172.168.0.2 not responding, closing
Aug 22 01:32:30 omc1scs1 cmcld: HB connection to 172.168.0.2 is responding
Aug 22 01:32:30 omc1scs1 cmcld: GS connection to 172.168.0.2 is responding
Aug 22 01:54:30 omc1scs1 cmcld: HB connection to 172.168.0.2 not responding, closing
Aug 22 01:54:30 omc1scs1 cmcld: GS connection to 172.168.0.2 not responding, closing
Aug 22 02:12:31 omc1scs1 cmcld: GS connection to 172.168.0.2 is responding
Aug 22 02:34:31 omc1scs1 cmcld: GS connection to 172.168.0.2 not responding, closing
Aug 22 02:55:31 omc1scs1 named[611]: zone oss.nmuni.com/IN: refresh: unexpected rcode (REFUSED) from master 172.168.0.5#8054
Aug 22 03:34:31 omc1scs1 cmcld: GS connection to 172.168.0.2 is responding
Aug 22 03:56:31 omc1scs1 cmcld: GS connection to 172.168.0.2 not responding, closing
Aug 22 07:36:33 omc1scs1 named[611]: zone oss.nmuni.com/IN: refresh: unexpected rcode (REFUSED) from master 172.168.0.5#8054
Aug 22 09:06:18 omc1scs1 rlogind[22425]: Login failure (exit(1) from login(1))
Aug 22 10:09:38 omc1scs1 su: + ta root-omc
Aug 22 10:22:33 omc1scs1 cmcld: HB connection to 172.168.0.2 is responding
Aug 22 10:22:33 omc1scs1 cmcld: GS connection to 172.168.0.2 is responding

and on the 2nd server, syslog.log show the following info:
....... skip .......
Aug 21 17:48:28 omc1dbsr cmcld: accept returned: No buffer space available
Aug 21 17:47:18 omc1dbsr named[611]: zone oss.nmuni.com/IN: refresh: failure trying master 172.168.0.5#8054: timed out
Aug 21 17:48:28 omc1dbsr above message repeats 3 times
Aug 21 17:48:28 omc1dbsr cmcld: Retrying accept due to a transient problem: No buffer space available.
Aug 21 17:48:28 omc1dbsr cmcld: accept returned: Resource temporarily unavailable
Aug 21 17:48:28 omc1dbsr cmcld: Retrying accept due to a transient problem: Resource temporarily unavailable.
Aug 21 17:48:28 omc1dbsr cmcld: accept failed due to a kernel problem: Resource temporarily unavailable.
Aug 21 17:48:31 omc1dbsr cmcld: accept returned: No buffer space available
Aug 21 17:48:28 omc1dbsr cmcld: Retrying accept due to a transient problem: Resource temporarily unavailable.
Aug 21 17:48:31 omc1dbsr cmcld: Retrying accept due to a transient problem: No buffer space available.
Aug 21 17:48:31 omc1dbsr cmcld: Retrying accept due to a transient problem: Resource temporarily unavailable.
Aug 21 17:48:31 omc1dbsr cmcld: accept failed due to a kernel problem: Resource temporarily unavailable.
Aug 21 17:48:31 omc1dbsr cmcld: Retrying accept due to a transient problem: Resource temporarily unavailable.
Aug 21 17:48:37 omc1dbsr cmcld: accept returned: No buffer space available
Aug 21 17:48:31 omc1dbsr cmcld: accept returned: Resource temporarily unavailable
Aug 21 17:48:37 omc1dbsr above message repeats 5 times

so the problem(Resource temporarily unavailable) happened timely, it works for 20-40 minuates and then fail again, what's the log file means for my system? And how can I check the system? the package didn't swithced, so nothing error in /etc/cmcluster.

please help!

wapper
5 REPLIES 5
Geoff Wild
Honored Contributor

Re: strange service guard problem

Sounds like you are out of system resources - "No buffer space available"...

How much memory is in these systems?

Can you post a: kmtune

Rgds...Geoff

Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.
RAC_1
Honored Contributor

Re: strange service guard problem

You have three different problem on your system. Two of then seem dependent on each other.

1. name server - You are running named and it has got some problems.
Check if named is running fine or not. Check if there were any recent changes in configuration of it or not.

2. The hearbeat link and SG link has problems. The name service resolution for cluser hosts should be done using /etc/hosts. That is fast, local and very easy to manage.
This problems seems to be depending upon problem 1.

Make us of /etc/hosts file for SG cluster nodes. Check what network problems were there. netfmt -f /var/adm/nettl.LOGxx

3. No buffer space available.
Many errors can give out this message. Possible checks. - swapinfo -mat, glance -m (memeory utilization)
There is no substitute to HARDWORK
wapper_1
New Member

Re: strange service guard problem

Geoff and RAC:

1, both of the server has 8 GB memory
2, BIND version is 9.2
3, named has some problem, that's true! we use namesurf as primary dns server, and it works on one of the SG package, and BIND be installed on each server as secondary dns, so all configuration file under /var/named are generated by named when it is start, but it only happened on 1sr server(omc1scs1), on 2nd server(omc1dbsr), the transfer failed because there is no port 8054( this is the lisenting port for the package which namesurf running)listening! so there is file transfer error like message in syslog.log in omc1dbsr, I can't access the server now because I am not physical there and it is night in China;-), but I will check the other thing later. Unfortunatelly I have no idea why 2nd server can't contact that port, the /etc/named.conf are same in both servers and nslookup works fine on omc1dbsr even the file transfer failed...

thanks!

wapper
Gavin Clarke
Trusted Contributor

Re: strange service guard problem

Just in case you haven't searched the forums you might want to look at this link:

http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=653340

It describes very similar symptoms, no really obvious answer, several patches recommended.

We have a very similar setup to yours and it's fine so far, MC is 11.16 though.
Gavin Clarke
Trusted Contributor

Re: strange service guard problem

I've even looked up the patches mentioned on the patch database:

Patch mentioned Latest version
PHSS_30028 PHSS_32260
PHNE_29473 PHNE_33395

I hope this helps