TruCluster
cancel
Showing results for 
Search instead for 
Did you mean: 

Cannot communicate with the CAA daemon

SOLVED
Go to solution
Victor Semaska_3
Esteemed Contributor

Cannot communicate with the CAA daemon

Greetings,

We have a 5 node cluster running Tru64 V5.1B PK5. I tried to relocate a service to one of our nodes and got the error:

Cannot communicate with the CAA daemon

Appears the caad daemon on that node has aborted. The server has been up for a couple months. The Cluster Admin Guide says to run the /usr/sbin/caad command.

This is a heavily used production cluster so I don't want to cause more problems. Is there anything that I should be aware of before running this command. caad is running on the other 4 nodes.

Thanks,
Vic
There are 10 kinds of people, one that understands binary and one that doesn't.
6 REPLIES
Mark Poeschl_2
Honored Contributor

Re: Cannot communicate with the CAA daemon

I think I'd do a little more digging before just re-starting the caa daemon. There is supposed to be an esmd process that detects that daemon missing and restarts it. Is there an 'esmd' process on your server? To try and find the cause of the caa daemon dying you might some things along the lines of:

# find /var/adm/syslog.dated -follow -name daemon.log -exec grep -i caad {} \;

- OR -

# evmget -f "[NAME sys.unix.clu.caa] & [Prio gt 200] & [BEFore YYYY:MMM:DD:HH:MM:SS] & [SINce YYYY:MMM:DD:HH:MM:SS]" | evmshow -D

(Inserting time specifiers in above command as appropriate)
Ivan Ferreira
Honored Contributor

Re: Cannot communicate with the CAA daemon

I had similar problems recently.

Ensure that the license is loaded.

Ensure that the cluster interconnect interface is up and working.

Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Ivan Ferreira
Honored Contributor

Re: Cannot communicate with the CAA daemon

Also, ensure that you can rlogin/rsh to that host.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Victor Semaska_3
Esteemed Contributor

Re: Cannot communicate with the CAA daemon

This is what I get with find:

# find /var/adm/syslog.dated -follow -name daemon.log -exec grep -i caad {} \;
Aug 5 14:01:36 csubds2 CAAD[1049398]: RTD #0: Action Script /var/cluster/caa/script/clustercron.scr(check) timed out! (timeout=60)
Aug 5 14:01:42 csubds2 CAAD[1049398]: An error was encountered while polling `clustercron`
Aug 6 13:18:16 csubds2 CAAD[1385896]: Couldn't detatch TTY! (No such device or address)

The Aug 5 14:01 is when another node in the cluster crashed. The evmget produced nothing.

The interesting thing is I did a lsof on the esmd and it has over 4,100 files open similar to this:

esmd 1048751 root 4088u unix 0x26c18fc0 0t0 ->(none)

Sounds like esmd needs to be restarted.

The licenses are loaded and the cluster interconnect interface (Memory Channel) is up and working.

I can rlogin/rsh to the host.
There are 10 kinds of people, one that understands binary and one that doesn't.
Mark Poeschl_2
Honored Contributor
Solution

Re: Cannot communicate with the CAA daemon

My esmd process only has three open files that look like the entry you listed. The man page for 'esmd' makes it sound like you can't run it from the command line - it must be started by its inittab entry. My default entry for esmd in /etc/inittab has a 'respawn' specifier so you can just kill it. (I tried this on a test cluster and it seemed to work fine.)

Failing that, I think you're stuck with rebooting this cluster member.
Victor Semaska_3
Esteemed Contributor

Re: Cannot communicate with the CAA daemon

Killing the esmd fixed the problem.
There are 10 kinds of people, one that understands binary and one that doesn't.