Re: Solaris NIS sever srash resulted in MCSG Node reboot

Mahesh Babbar · ‎08-30-2004

Hello,

Environment is:

XP512 as Backend storage array
HPUX 11i two node cluster as NFS gateways connected via brocade switches
Solaris 8 as NIS master server

Background and Issue
--------------------

1. The Solaris NIS server went down at 5.00 AM in the morning.

2. This resulted in following mentioned messages in the /var/adm/syslog/syslog.log of the

NODE on which the packages were running.

Aug 31 07:39:10 cmgtpn1 syslog: svc_getreqset: No transport handle for fd 257
Aug 31 07:39:10 cmgtpn1 syslog: svc_getreqset: No transport handle for fd 258
-
-
-
Aug 31 09:08:41 cmgtpn1 EMS [1923]: ------ EMS Event Notification ------ Value: "error"

for Resource: "/cluster/package/package_status/itacpkg1" (Threshold: != " 1")

3. At 9.11 AM, the node cmgtpn1 got rebooted with following messages in the

/var/adm/syslog/syslog.log file.

Aug 31 09:11:15 cmgtpn1 cmsrvassistd[14799]: Unable to communicate with ServiceGuard main

daemon (cmcld): Network is unreachable
Aug 31 09:12:17 cmgtpn1 cmclconfd[1805]: The ServiceGuard daemon, /usr/lbin/cmcld[1806],

died upon receiving signal number 6.

QUESTION : WHY THE NODE SHOULD BE REBOOTED IN CASE THE NIS SERVER HAS GONE DOWN ?

4. All the packages shifted to second node at 9:14 AM.

5. Following messages start to appear again on the second node (cmgtpn2). The NIS server is

still DOWN.

Aug 31 09:21:29 cmgtpn2 syslog: svc_getreqset: No transport handle for fd 257
Aug 31 09:21:29 cmgtpn2 syslog: svc_getreqset: No transport handle for fd 258
Aug 31 09:21:29 cmgtpn2 syslog: svc_getreqset: No transport handle for fd 259
Aug 31 09:25:00 cmgtpn2 syslog: svc_getreqset: No transport handle for fd 260
-
-
-
Aug 31 09:37:17 cmgtpn2 syslog: svc_getreqset: No transport handle for fd 261
Aug 31 09:37:17 cmgtpn2 syslog: svc_getreqset: No transport handle for fd 262
Aug 31 09:37:17 cmgtpn2 syslog: svc_getreqset: No transport handle for fd 259
Aug 31 09:40:48 cmgtpn2 syslog: svc_getreqset: No transport handle for fd 257
Aug 31 09:51:43 cmgtpn2 cmsrvassistd[1842]: Lost connection with ServiceGuard cluster

daemon (cmcld): Connection timed out
Aug 31 09:54:49 cmgtpn2 cmlvmd: Could not read messages from /usr/lbin/cmcld: Connection

timed out
Aug 31 09:54:49 cmgtpn2 cmlvmd: CLVMD exiting
Aug 31 09:54:49 cmgtpn2 cmsrvassistd[149]: Unable to communicate with ServiceGuard main

daemon (cmcld): Network is unreachable
Aug 31 09:55:51 cmgtpn2 cmclconfd[1800]: The ServiceGuard daemon, /usr/lbin/cmcld[1801],

died upon receiving signal number 6.
#

6. At that point, the second node also went rebooted with packages throwing back on the

first node.

QUESTION : Why the cluster node should get rebooted if the NIS server crashes ?

Regards

Mahesh

melvyn burnard · ‎08-30-2004

How are the two nodes configured to get their services?
SG needs to have access to specific services as listed in the /etc/services file
grep hacl /etc/services
If the nodes are configured to use NIS and NOT their own serviecs, then when the NIS server died, they had communication issues, and therefore you get errors.
Also the package may have reliance on the NIS server, and may have a FAIL_FAST variable set to yes which would TOC the node if the package lost a service.
these are the clues:

cmsrvassistd[1842]: Lost connection with ServiceGuard cluster daemon (cmcld): connection timed out
etc

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

Mahesh Babbar · ‎08-30-2004

Hi Melvyn,

The cluster is not using any of the services (which are defined in /etc/services) off NIS.

It's confgured to be an NIS client which is taking user/group and host database from the NIS server.

Regards

Mahesh

Duncan Edmonstone · ‎08-30-2004

It may be that the systems are configured to try NIS first, and not to continue if it fails...

Can you post the contents of the file /etc/nsswitch.conf from both nodes?

HTH

Duncan

I am an HPE Employee

melvyn burnard · ‎08-30-2004

You should also check your package logs to see if they have any further details on the symptom.
You certainly have lost some network connectivity according to the syslog entries

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

Mahesh Babbar · ‎08-30-2004

Hi Duncan,

the entries as defined in the /etc/nsswitch.conf of both the nodes are as below:

#
# /etc/nsswitch.nis:
#
# @(#)B.11.11_LR
#
# An example file that could be copied over to /etc/nsswitch.conf; it
# uses NIS (YP) in conjunction with files.
#

passwd: files nis
group: files nis
hosts: files [NOTFOUND=continue] nis [NOTFOUND=continue] dns
#hosts: files
#hosts: dns [NOTFOUND=continue] nis [NOTFOUND=continue] files
networks: nis [NOTFOUND=return] files
protocols: nis [NOTFOUND=return] files
rpc: nis [NOTFOUND=return] files
publickey: nis [NOTFOUND=return] files
netgroup: nis [NOTFOUND=return] files
automount: files nis
aliases: files nis
services: files nis

One critical information is that even after the NIS server came back, both the nodes were not able to resolve the hosts name. They started to resolve the host names only after stopping and starting the NIS clients.

There is also a core file residing in /var/adm/cmcluster directory. Would sending of that help in any way ?

Thanks and Regards,

Mahesh

melvyn burnard · ‎08-30-2004

So you have a probl;em due to hte NIS environmebnt.
You need to investigate that and sort it out.
The core file is of no use in this respect.

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

Steve Lewis · ‎08-31-2004

You shouldn't make both servers reliant on the same, single machine that lies outside of the cluster. Your NIS server has become your SPOF.

I suspect that one problem may be with the other lines in your nsswitch file, which have nis before files such as:
networks nis files
protocols nis files

Change these entries round to put files first, then put a secondary NIS server in place, which isn't reliant on the primary NIS server.

Maybe you could create a NIS server package in MC/SG? I am not sure how feasible that is, but its an idea.

Steve Lewis · ‎08-31-2004

Also I just noticed that your servers use NFS, yet you also have

rpc: nis files

in your nsswitch file. RPC is a fundamental requirement of NFS yet you have a SPOF of the NIS server.

Change this entry around as well.

Geoff Wild · ‎08-31-2004

One of the first rules in a HA environment - never make your HA servers reliant on an external service...

For everything, it should be files first.

# cat nsswitch.conf
#
# /etc/nsswitch.files:
#
# @(#)B.11.11_LR
#
# An example file that could be copied over to /etc/nsswitch.conf; it
# does not use any name services.
#
passwd: files
group: files
hosts: files [NOTFOUND=CONTINUE] dns
services: files
networks: files
protocols: files
rpc: files
publickey: files
netgroup: files
automount: files
aliases: files

Rgds...Geoff

Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.

Mahesh Babbar · ‎08-31-2004

Thanks to all for responding.

Most probably the problem is in nsswitch.conf file only.

Corrected the entries. However to test that out I need to break the connection with NIS server.

As of now, my production environment does not allow to test. Will keep the testing in my list of TODO in next shutdown

Best Regards

Mahesh

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Solaris NIS sever srash resulted in MCSG Node reboot

Solaris NIS sever srash resulted in MCSG Node reboot