cancel
Showing results for 
Search instead for 
Did you mean: 

Package doesn't start

pfred
Occasional Visitor

Package doesn't start

I've been trying to configure SG, but I've encountered a problem which I can't cope with.
When there are parameters service_name[0], service_cmd[0], service_restart[0] in a control script, a package doesn't want to start. If these parameters are commented, everything works fine.
The service name is identical in the control script and the package configuration file.



package configuration file (pkg1.config):
PACKAGE_NAME pkg1
PACKAGE_TYPE FAILOVER
NODE_NAME sg1
NODE_NAME sg2
AUTO_RUN YES
NODE_FAIL_FAST_ENABLED NO
RUN_SCRIPT /usr/local/cmcluster/conf/pkg1/pkg1.sh
HALT_SCRIPT /usr/local/cmcluster/conf/pkg1/pkg1.sh
RUN_SCRIPT_TIMEOUT NO_TIMEOUT
HALT_SCRIPT_TIMEOUT NO_TIMEOUT
SUCCESSOR_HALT_TIMEOUT NO_TIMEOUT
FAILOVER_POLICY CONFIGURED_NODE
FAILBACK_POLICY MANUAL
PRIORITY NO_PRIORITY
MONITORED_SUBNET 10.1.40.0
SERVICE_NAME pkg1.monitor
SERVICE_FAIL_FAST_ENABLED no
SERVICE_HALT_TIMEOUT 300


control script (pkg1.sh):
sglinux[0]=1 >/dev/null 2>&1
if [ $? -gt 0 ]; then
exec /bin/bash2 -c "$0 $*"
exit 1
fi
. ${SGCONFFILE:=/etc/cmcluster.conf}
PATH=$SGSBIN:/bin:/sbin:/usr/bin:/usr/sbin
GFS="NO"
DATA_REP="none"
VG[0]=vg01
LV[0]=/dev/vg01/lvol00; FS[0]=/srv/iscsi1; FS_TYPE[0]="ext3"; FS_MOUNT_OPT[0]="-o rw"
FS_UMOUNT_OPT[0]=""; FS_FSCK_OPT[0]=""
FS_UMOUNT_COUNT=1
FS_MOUNT_RETRY_COUNT=0
CONCURRENT_FSCK_OPERATIONS=1
CONCURRENT_MOUNT_AND_UMOUNT_OPERATIONS=1
IP[0]=10.1.40.223
SUBNET[0]=10.1.40.0
PR_TYPE_WERO="--prout-type=5"
ABORT_KEY="--param-sark"
SG_PERSIST_RDKEYS="sg_persist -k"
SG_PERSIST_RDRESV="sg_persist -r"
SG_PERSIST_REG="sg_persist --out -G --param-sark"
SG_PERSIST_REG_IGN="sg_persist --out -I --param-sark"
SG_PERSIST_UNREG="sg_persist --out -G --param-rk"
SG_PERSIST_RESV="sg_persist --out -R --param-rk"
SG_PERSIST_PREEMPT="sg_persist --out -A --param-rk"
SG_PERSIST_CLEAR="sg_persist --out -C --param-rk"
SERVICE_NAME[0]=pkg1.monitor
SERVICE_CMD[0]="/usr/local/cmcluster/conf/pkg1/pkg1.mon"
SERVICE_RESTART[0]="-r 2"


pkg1.mon: (content is silly, it's not important at the moment, probably. The script has to work and it works. The new line is appended to the czas.log file during cmrunpkg)
#!/bin/sh
/bin/echo `/bin/date` >> /tmp/czas.log


Following lines appear in log files when I use cmrunpkg command and lines service_xxx[0] in pkg1.sh are uncommented

/var/log/message

May 11 12:55:51 sg1 xinetd[3016]: EXIT: hacl-cfgudp status=0 pid=7192 duration=17(sec)
May 11 12:55:55 sg1 xinetd[3016]: START: hacl-cfgudp pid=7212 from=127.0.0.1
May 11 12:55:55 sg1 cmrunpkg: cmrunpkg -v pkg1
May 11 12:55:55 sg1 xinetd[3016]: START: hacl-cfg pid=7217 from=127.0.0.1
May 11 12:55:55 sg1 xinetd[7217]: USERID: hacl-cfg UNIX :root
May 11 12:55:55 sg1 cmrunpkg: Request from root on node sg1 to start package pkg1
May 11 12:55:55 sg1 cmcld[17505]: Request from root on node sg1 to start package pkg1
May 11 12:55:55 sg1 cmcld[17505]: Request from node sg1 to start package pkg1 on node sg1.
May 11 12:55:55 sg1 cmcld[17505]: Executing '/usr/local/cmcluster/conf/pkg1/pkg1.sh start' for package pkg1, as service PKG*44033.
May 11 12:55:55 sg1 cmserviced[17514]: Request to perform run service PKG*44033
May 11 12:55:55 sg1 xinetd[3016]: START: hacl-cfg pid=7235 from=127.0.0.1
May 11 12:55:55 sg1 xinetd[7235]: USERID: hacl-cfg UNIX :root
May 11 12:55:55 sg1 xinetd[3016]: EXIT: hacl-cfg status=0 pid=7235 duration=0(sec)
May 11 12:55:55 sg1 xinetd[3016]: START: hacl-cfg pid=7250 from=127.0.0.1
May 11 12:55:55 sg1 xinetd[7250]: USERID: hacl-cfg UNIX :root
May 11 12:55:55 sg1 xinetd[3016]: EXIT: hacl-cfg status=0 pid=7250 duration=0(sec)
May 11 12:55:56 sg1 kernel: kjournald starting. Commit interval 5 seconds
May 11 12:55:56 sg1 kernel: EXT3 FS on dm-2, internal journal
May 11 12:55:56 sg1 kernel: EXT3-fs: mounted filesystem with ordered data mode.
May 11 12:55:56 sg1 cmmodnet: cmmodnet -a -i 10.1.40.223 10.1.40.0
May 11 12:55:56 sg1 avahi-daemon[2835]: Registering new address record for 10.1.40.223 on eth0.
May 11 12:55:56 sg1 avahi-daemon[2835]: Withdrawing address record for 10.1.40.223 on eth0.
May 11 12:55:56 sg1 avahi-daemon[2835]: Registering new address record for 10.1.40.223 on eth0.
May 11 12:55:56 sg1 cmrunserv: cmrunserv -r 2 pkg1.monitor >> /usr/local/cmcluster/conf/pkg1/pkg1.sh.log 2>&1 /usr/local/cmcluster/conf/pkg1/pkg1.mon
May 11 12:55:56 sg1 cmserviced[17514]: Request to perform run service pkg1.monitor
May 11 12:55:56 sg1 cmserviced[17514]: Service PKG*44033 terminated due to an exit(0).
May 11 12:55:56 sg1 cmcld[17505]: Started package pkg1 on node sg1.
May 11 12:55:56 sg1 xinetd[3016]: EXIT: hacl-cfg status=0 pid=7217 duration=1(sec)
May 11 12:55:56 sg1 cmserviced[17514]: Service pkg1.monitor terminated due to an exit(0).
May 11 12:55:56 sg1 cmserviced[17514]: Automatically restarted service pkg1.monitor for the 1st time after failure.
May 11 12:55:56 sg1 cmserviced[17514]: Service pkg1.monitor terminated due to an exit(0).
May 11 12:55:56 sg1 cmserviced[17514]: Automatically restarted service pkg1.monitor for the 2nd time after failure.
May 11 12:55:56 sg1 cmserviced[17514]: Service pkg1.monitor terminated due to an exit(0).
May 11 12:55:56 sg1 cmserviced[17514]: Request to perform run service PKG*44033
May 11 12:55:56 sg1 cmcld[17505]: Service pkg1.monitor in package pkg1 has gone down.
May 11 12:55:56 sg1 cmcld[17505]: Disabled node sg1 from running package pkg1.
May 11 12:55:56 sg1 cmcld[17505]: Failing package pkg1 on node sg1 due to service failure.
May 11 12:55:56 sg1 cmcld[17505]: Request from node sg1 to fail package pkg1 on node sg1.
May 11 12:55:56 sg1 cmcld[17505]: Executing '/usr/local/cmcluster/conf/pkg1/pkg1.sh stop' for package pkg1, as service PKG*44033.
May 11 12:55:56 sg1 cmhaltserv: cmhaltserv pkg1.monitor
May 11 12:55:56 sg1 cmserviced[17514]: Request to perform halt service pkg1.monitor
May 11 12:55:56 sg1 cmmodnet: cmmodnet -r -i 10.1.40.223 10.1.40.0
May 11 12:55:56 sg1 avahi-daemon[2835]: Withdrawing address record for 10.1.40.223 on eth0.
May 11 12:55:56 sg1 xinetd[3016]: START: hacl-cfg pid=7391 from=127.0.0.1
May 11 12:55:56 sg1 xinetd[7391]: USERID: hacl-cfg UNIX :root
May 11 12:55:56 sg1 xinetd[3016]: EXIT: hacl-cfg status=0 pid=7391 duration=0(sec)
May 11 12:55:56 sg1 xinetd[3016]: START: hacl-cfg pid=7401 from=127.0.0.1
May 11 12:55:56 sg1 xinetd[7401]: USERID: hacl-cfg UNIX :root
May 11 12:55:56 sg1 xinetd[3016]: EXIT: hacl-cfg status=0 pid=7401 duration=0(sec)
May 11 12:55:56 sg1 cmserviced[17514]: Service PKG*44033 terminated due to an exit(0).
May 11 12:55:56 sg1 cmcld[17505]: Halted package pkg1 on node sg1.
May 11 12:56:11 sg1 xinetd[3016]: EXIT: hacl-cfgudp status=0 pid=7212 duration=16(sec)
May 11 12:56:18 sg1 xinetd[3016]: START: hacl-cfgudp pid=7425 from=127.0.0.1
May 11 12:56:18 sg1 xinetd[3016]: START: hacl-cfg pid=7430 from=127.0.0.1
May 11 12:56:18 sg1 xinetd[7430]: USERID: hacl-cfg UNIX :root
May 11 12:56:18 sg1 xinetd[3016]: EXIT: hacl-cfg status=0 pid=7430 duration=0(sec)
May 11 12:56:33 sg1 xinetd[3016]: EXIT: hacl-cfgudp status=0 pid=7425 duration=15(sec)


pkg1.sh.log

###### Node "sg1.localdomain": Starting package at Mon May 11 12:55:55 CEST 2009 ######
WARNING: /dev/sdb1 does not support Persistent Reservations
WARNING: Persistent Reservations disabled for package
WARNING: In some configurations this may result in data
WARNING: corruption under certain conditions. Please
WARNING: check the documentation for more details.
PR in: command not supported
Clearing existing PR key
Attempting to addtag to vg vg01...
addtag was successful on vg vg01.
May 11 12:55:56 - Node "sg1.localdomain": Activating volume group vg01 .
May 11 12:55:56 - Node "sg1.localdomain": Checking filesystems:
/dev/vg01/lvol00
e2fsck 1.39 (29-May-2006)
/dev/vg01/lvol00: clean, 11/128000 files, 8444/256000 blocks
May 11 12:55:56 - Node "sg1.localdomain": Mounting /dev/vg01/lvol00 at /srv/iscsi1
May 11 12:55:56 - Node "sg1.localdomain": Adding IP address 10.1.40.223 to subnet 10.1.40.0
May 11 12:55:56 - Node "sg1.localdomain": Starting service pkg1.monitor using
"/usr/local/cmcluster/conf/pkg1/pkg1.mon"
###### Node "sg1.localdomain": Package start completed at Mon May 11 12:55:56 CEST 2009 ######

####### Node "sg1.localdomain": Halting package at Mon May 11 12:55:56 CEST 2009 #######
May 11 12:55:56 - Node "sg1.localdomain": Halting service pkg1.monitor
cmhaltserv: Service name pkg1.monitor is not running.
May 11 12:55:56 - Node "sg1.localdomain": Remove IP address 10.1.40.223 from subnet 10.1.40.0
May 11 12:55:56 - Node "sg1.localdomain": Unmounting filesystem on /srv/iscsi1
May 11 12:55:56 - Node "sg1.localdomain": Deactivating volume group vg01
Attempting to deltag to vg vg01...
deltag was successful on vg vg01.
PR in: command not supported
###### Node "sg1.localdomain": Package halt completed at Mon May 11 12:55:56 CEST 2009 ######



My configuration:
Red Hat Enterprise Linux 5.3, SG A.11.19.00 (demo version)


What am I doing wrong. Does anybody have any suggestions?


Regards
3 REPLIES
Jozef_Novak
Respected Contributor

Re: Package doesn't start

Hello,

from what I can see from the log files you posted is:

- you are trying to run and monitor a service that consists of a single script

- the script is run by Serviceguard during package startup, performs the commands and exits in a normal way (exit 0)

- Serviceguard interprets this exit as a service failure and tries to restart the service. The script is run again and again exits with return code 0

- after attemting to restart the service (the script is run twice), Serviceguard assumes that the service is unable to run on the current node and halts the package due to a service failure

I believe this works as expected. In my opinion, the service to be monitored should be a constantly running process.

J.
Matti_Kurkela
Honored Contributor

Re: Package doesn't start

Your "service" script (/usr/local/cmcluster/conf/pkg1/pkg1.mon) is exiting. It should keep running forever when everything is OK, and exit only if something is wrong. When the service script exits, ServiceGuard assumes something may be wrong. As configured, it attempts to restart the service script exactly 2 times.

If the purpose of the service script is to monitor the state of the real application, the service script should be essentially an infinite loop running the necessary tests and then sleeping for a while before re-testing.

If the actual application is slow to start up, you might want to add an extra sleep in the service script just before entering the testing loop, so that the application has plenty of time to start up before the testing loop begins.

The "AUTO_RUN YES" setting is effective only when starting the entire cluster (usually with the cmruncl command). If you are starting the package with cmrunpkg, you must then "arm" the fail-over mechanism by running "cmmodpkg -e ". When you halt the package manually using the cmhaltpkg command, the AUTO_RUN setting is automatically disabled, disarming the fail-over.

MK
MK
pfred
Occasional Visitor

Re: Package doesn't start

Thank you for your help and exhaustive explanations.
I'm a beginner in SG therefore I couldn't solve the problem.
Fortunately I've got the idea.

Regards
pfred