Operating System - HP-UX
1752795 Members
6152 Online
108789 Solutions
New Discussion юеВ

Reboot after panic: SafetyTimer expired

 
DeafFrog
Valued Contributor

Reboot after panic: SafetyTimer expired

Dear Gurus ,

here is the output from cmfmtfr frdump.cmcld.9
i have checked the systems logs and there is no resource crunch for the safety timer to expire.

=======
May 23 11:53:08:1:CLM:09: Frequent Action - HB from node apple
May 23 11:53:08:1:CLM:09: Frequent Action - Updated safety time to 79361781
May 23 11:53:08:1:CLM:09: Frequent Action - HB to node apple, 64199
May 23 11:53:09:1:CLM:09: Frequent Action - HB from node apple
May 23 11:53:09:1:CLM:09: Frequent Action - Updated safety time to 79361882
May 23 11:53:09:1:CLM:09: Frequent Action - HB to node apple, 64200
May 23 11:53:10:1:CLM:09: Frequent Action - HB from node apple
May 23 11:53:10:1:CLM:09: Frequent Action - Updated safety time to 79361983
May 23 11:53:10:1:CLM:09: Frequent Action - HB to node apple, 64201
May 23 11:53:11:1:CLM:09: Frequent Action - HB from node apple
May 23 11:53:11:1:CLM:09: Frequent Action - Updated safety time to 79362084
May 23 11:53:11:1:CLM:09: Frequent Action - HB to node apple, 64202
May 23 11:53:14:4:LOC:07: Event - Handling reading from local connection on fd 20.
May 23 11:53:14:2:STA:06: Action - sdb request 7 on port 8
May 23 11:53:14:2:STA:06: Action - srv_sdb_set_status_private: Set status 1 0/1/2/0 for client=4538 (size 124)
May 23 11:53:14:2:SDB:06: Action - st_set_status: Set status entry for 0/1/2/0 of type 1 on node_id 1.
May 23 11:53:14:2:SDB:06: Action - st_set_status: valuesize = 124, historysize = 0.
May 23 11:53:14:4:LOC:07: Event - Handling reading from local connection on fd 20.
May 23 11:53:14:2:STA:06: Action - sdb request 7 on port 8
May 23 11:53:14:2:STA:06: Action - srv_sdb_set_status_private: Set status 1 0/4/1/0/6/0 for client=4538 (size 124)
May 23 11:53:14:2:SDB:06: Action - st_set_status: Set status entry for 0/4/1/0/6/0 of type 1 on node_id 1.
May 23 11:53:14:2:SDB:06: Action - st_set_status: valuesize = 124, historysize = 0.
May 23 11:53:14:4:LOC:07: Event - Handling reading from local connection on fd 20.
May 23 11:53:14:2:STA:06: Action - sdb request 7 on port 8
May 23 11:53:14:2:STA:06: Action - srv_sdb_set_status_private: Set status 4 10.0.4.0 for client=4538 (size 76)
May 23 11:53:14:2:SDB:06: Action - st_set_status: Set status entry for 10.0.4.0 of type 4 on node_id 1.
May 23 11:53:14:2:SDB:06: Action - st_set_status: valuesize = 76, historysize = 0.
May 23 11:53:14:2:PKG:06: Action - pm_subnet_callback: Check for subnet 10.0.4.0 with status of 1
May 23 11:53:14:2:PKG:06: Action - pm_subnet_check: Check for package 36866 with subnet status of 1
May 23 11:53:14:3:PKG:06: Action - pm_subnet_check: subnet down, notify not eligible
May 23 11:53:14:2:PKG:06: Action - pm_subnet_check: Check for package 26883 with subnet status of 1
May 23 11:53:14:3:PKG:06: Action - pm_subnet_check: subnet down, notify not eligible
May 23 11:53:14:2:PKG:06: Action - pm_subnet_check: Check for package 57089 with subnet status of 1
May 23 11:53:14:3:PKG:06: Action - pm_subnet_check: subnet down, notify not eligible
May 23 11:53:14:4:LOC:07: Event - Handling reading from local connection on fd 20.
May 23 11:53:14:2:STA:06: Action - sdb request 7 on port 8
May 23 11:53:14:2:STA:06: Action - srv_sdb_set_status_private: Set status 4 10.0.4.0 for client=4538 (size 76)
May 23 11:53:14:2:SDB:06: Action - st_set_status: Set status entry for 10.0.4.0 of type 4 on node_id 1.
May 23 11:53:14:2:SDB:06: Action - st_set_status: valuesize = 76, historysize = 0.
May 23 11:53:14:2:SDB:06: Action - Status entry value for 10.0.4.0 of type 4 on node_id 1 did not change.
May 23 11:53:14:4:LOC:07: Event - Handling reading from local connection on fd 20.
May 23 11:53:14:2:STA:06: Action - sdb request 7 on port 8
May 23 11:53:14:2:STA:06: Action - srv_sdb_set_status_private: Set status 1 0/1/2/0 for client=4538 (size 124)
May 23 11:53:14:2:SDB:06: Action - st_set_status: Set status entry for 0/1/2/0 of type 1 on node_id 1.
May 23 11:53:14:2:SDB:06: Action - st_set_status: valuesize = 124, historysize = 0.
May 23 11:53:14:2:SDB:06: Action - Status entry value for 0/1/2/0 of type 1 on node_id 1 did not change.
May 23 11:53:14:4:LOC:07: Event - Handling reading from local connection on fd 20.
May 23 11:53:14:2:STA:06: Action - sdb request 7 on port 8
May 23 11:53:14:2:STA:06: Action - srv_sdb_set_status_private: Set status 1 0/4/1/0/6/0 for client=4538 (size 124)
May 23 11:53:14:2:SDB:06: Action - st_set_status: Set status entry for 0/4/1/0/6/0 of type 1 on node_id 1.
May 23 11:53:14:2:SDB:06: Action - st_set_status: valuesize = 124, historysize = 0.
May 23 11:53:17:0:CLM:08: Event - Timed out node apple. It may have failed.
May 23 11:53:17:2:CLM:08: Action - apple: eligible=1, received=0
May 23 11:53:17:2:CLM:08: Action - apple: no_st_update, ready_st=1, first_hb_recvd=1
May 23 11:53:17:2:CLM:08: External error - Detected failure event for node apple
May 23 11:53:17:1:UTL:08: Action - Automatic dump request is issued.
May 23 11:53:17:2:UTL:08: Action - fr_auto_dump_cmcld() queued an event
May 23 11:53:17:1:CLM:08: Action - Removing node apple from the running cluster
May 23 11:53:17:2:CLM:08: Action - Disconnecting HB comm to node apple
May 23 11:53:17:3:REM:08: Action - Config status for node 2 in service 2 is DISCONNECT
May 23 11:53:17:3:REM:08: Action - Inbound to 10.0.4.3 at node 2 in service 2 is CLOSING
May 23 11:53:17:3:REM:08: Action - Outbound to 10.0.4.3 at node 2 in service 2 is CLOSING
May 23 11:53:17:3:REM:08: Action - Connection to 10.0.4.3 at node 2 in service 2 is CLOSING
May 23 11:53:17:3:REM:08: Action - Node 2 in service 2 is CLOSING
May 23 11:53:17:2:UTL:06: Action - fr_dump_cmcld_event_handler() get request
May 23 11:53:17:2:CLM:08: Action - Decrementing old votes to 1
May 23 11:53:17:2:CLM:08: Action - Changed apple's cm_state from RUNNING to RECONFIG
May 23 11:53:17:2:CLM:08: Action - Removing node apple from heartbeat list
May 23 11:53:17:2:CLM:08: Action - Cancelled HB timers for node 2
May 23 11:53:17:1:CLM:08: Action - Got exactly 50% of the votes: 1 out of 2 last active nodes.
May 23 11:53:17:0:CLM:08: Action - Attempting to adjust cluster membership
May 23 11:53:17:0:CLM:08: Action - Beginning standard partial election
May 23 11:53:17:1:CLM:08: Action - Safety time set for 68.96 seconds from now
May 23 11:53:17:2:CLM:08: Action - Changed inbapplive1's cm_state from 5 to RECONFIG
May 23 11:53:17:2:SDB:08: Action - st_set_status: Set status entry for CM_STATUS_NAME of type 3 on node_id 1.
May 23 11:53:17:2:SDB:08: Action - st_set_status: valuesize = 104, historysize = 0.
May 23 11:53:17:2:UTL:08: Action - Starting sync timer of 63.960000 seconds
May 23 11:53:17:2:CLM:08: Action - Entering CANDIDATE state
May 23 11:53:17:1:CLM:08: Action - Did not receive all votes: 1 out of 2
May 23 11:53:17:1:CLM:08: Action - All votes (100%) are required at this point.
May 23 11:53:17:2:CLM:08: Action - Starting election timer of 7000000 usec
May 23 11:53:17:1:UTL:06: Action - Automatic dump is scheduled after 1000000 usec.
May 23 11:53:17:2:STA:06: Action - srv_sdb_process_callbacks: Received callbk for type(3), name(CM_STATUS_NAME), priority(25).
May 23 11:53:17:2:STA:06: Action - srv_sdb_process_sync_callbacks: No client is registered for 3 CM_STATUS_NAME at 25.
May 23 11:53:17:2:STA:06: Action - sdb async callbacks for 3 at 25
May 23 11:53:17:3:STA:06: Action - sdb client 4538 port 8
May 23 11:53:17:3:REM:09: Action - Inbound to 10.0.4.3 at node 2 in service 2 is NONE
May 23 11:53:17:3:REM:09: Action - Outbound to 10.0.4.3 at node 2 in service 2 is NONE
May 23 11:53:17:3:REM:09: Action - Connection to 10.0.4.3 at node 2 in service 2 is NONE
May 23 11:53:17:3:REM:09: Action - Busy conns is 0 for node 2 in service 2
May 23 11:53:17:3:REM:09: Action - Node 2 in service 2 is DROPPING
May 23 11:53:17:3:REM:09: Action - Node 2 in service 2 is DROPPED
May 23 11:53:18:2:UTL:08: Action - FR: start dump
May 23 11:53:18:3:UTL:08: Action - Automatic dump is not scheduled now.

Regards ,
Rahul

FrogIsDeaf
7 REPLIES 7
Steven E. Protter
Exalted Contributor

Re: Reboot after panic: SafetyTimer expired

Shalom Rahul,

Was the heartbeat network interrupted? This looks like one cluster node went TOC to avoid data corruption.



SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
TTr
Honored Contributor

Re: Reboot after panic: SafetyTimer expired

You need to provide more details here. What are you asking or what happened? Did one of the nodes go down on schedule or unexpectedly or did the HB communication fail? There could a lot of conditions that can lead to it. Check here http://forums11.itrc.hp.com/service/forums/questionanswer.do?threadId=927812
DeafFrog
Valued Contributor

Re: Reboot after panic: SafetyTimer expired

V. Shalom Steven ,

Many thanks for such a quick reply.I just had talk with one of our networking folks,
they deny any activity at their end.Also , other cluster , though in different vlan , doesn't show similar logs.

Regards ,
FrogIsDeaf
Michael Steele_2
Honored Contributor

Re: Reboot after panic: SafetyTimer expired

Hi

'cmscancl -n node -o /tmp/outfile' will perform linkloop tests on all nics for all nodes. Note that MCSG is mac address driven and not ip address driven.

'cmreadlog /var/opt/cmon/cmond.d', and,
'cmreadlog /var/opt/sgmgr/#######mgr.log' if you have later version of MCSG. These commands and logs were not available in earlier versions.

Also refer to 'cmgetconf' to get current cluster configuration w/in the Kernel ( as opposed to using the cluster.ascii file ), and 'cmviewconf' also useful in troubleshooting.
Support Fatherhood - Stop Family Law
DeafFrog
Valued Contributor

Re: Reboot after panic: SafetyTimer expired

Hello TTr,
I want to know if the cluster configuration , n/w etc. have caused this.
One of the node went down unexpectedly.
As far as i can make out , HBAs are ok.

Hello Michael ,

linkloop b/w nodes is OK.
I have the latest version of SG , but there is no /var/opt/cmon/cmond.d'
and /var/opt/sgmgr/hpsmh/log is zero.
cmviewconf is attached

Regards,
FrogIsDeaf
DeafFrog
Valued Contributor

Re: Reboot after panic: SafetyTimer expired

i am not able to attach .......



Cluster information:

cluster name: mwcluster1
version: 0
flags: 12 (single cluster lock)
heartbeat interval: 1.00 (seconds)
node timeout: 6.00 (seconds)
heartbeat connection timeout: 0.00 (seconds)
io timeout extension: 0.00 (seconds)
auto start timeout: 600.00 (seconds)
network polling interval: 2.00 (seconds)
network failure detection: INOUT
first lock vg name: /dev/vg02
second lock vg name: (not configured)
qs host: (not configured)

Cluster Node information:

Node ID 2:
Node name: orange
first lock pv name: /dev/dsk/c10t0d1
first lock disk interface type: fcd_vbus

Network ID 1:
ppa: 0
old_ppa: 0
mac addr: 0x001cc4fca46d
hardware path: 0/1/2/0
network interface name: lan0

IPv4 Information:
subnet: 10.0.4.0
subnet mask: 255.255.255.0
ip address: 10.0.4.3

route id: 0

IPv6 Information:

flags: 5 (Heartbeat Network)
bridged net ID: 1

Network ID 3:
ppa: 2
old_ppa: 0
mac addr: 0x00215a9d6008
hardware path: 0/2/1/0/6/0
network interface name: lan2

IPv4 Information:
subnet: 67.0.0.0
subnet mask: 255.0.0.0
ip address: 67.2.3.2

route id: 0

IPv6 Information:

flags: 4 (Non-Heartbeat Network)
bridged net ID: 2

Network ID 2:
ppa: 3
old_ppa: 0
mac addr: 0x00215a9d6014
hardware path: 0/4/1/0/6/0
network interface name: lan3

IPv4 Information:
subnet: 0.0.0.0
subnet mask: 0.0.0.0
ip address: 0.0.0.0

route id: 0

IPv6 Information:

flags: 2 (Non-Heartbeat Network)
bridged net ID: 1

Node ID 1:
Node name: apple
first lock pv name: /dev/dsk/c14t0d1
first lock disk interface type: fcd_vbus

Network ID 1:
ppa: 0
old_ppa: 0
mac addr: 0x001cc4fca4ad
hardware path: 0/1/2/0
network interface name: lan0

IPv4 Information:
subnet: 10.0.4.0
subnet mask: 255.255.255.0
ip address: 10.0.4.2

route id: 0

IPv6 Information:

flags: 5 (Heartbeat Network)
bridged net ID: 1

Network ID 3:
ppa: 2
old_ppa: 0
mac addr: 0x00215a9d6017
hardware path: 0/2/1/0/6/0
network interface name: lan2

IPv4 Information:
subnet: 67.0.0.0
subnet mask: 255.0.0.0
ip address: 67.2.3.1

route id: 0

IPv6 Information:

flags: 4 (Non-Heartbeat Network)
bridged net ID: 2

Network ID 2:
ppa: 3
old_ppa: 0
mac addr: 0x00215a9d6019
hardware path: 0/4/1/0/6/0
network interface name: lan3

IPv4 Information:
subnet: 0.0.0.0
subnet mask: 0.0.0.0
ip address: 0.0.0.0

route id: 0

IPv6 Information:

flags: 2 (Non-Heartbeat Network)
bridged net ID: 1

Cluster Access Policy Information: (Not Defined)

Package information:

maximum configured packages: 150

package ID 36866:
package name: csisdbpkg1
package global flags: 5
(Package Switch Enabled)
(Package Local Switch Enabled)
(Configured Node Failover)
(Manual Failback)
package priority: (No Priority)
package run script: /etc/cmcluster/csisdbpkg1/control.sh
package run timeout: (No Timeout)
package halt script: /etc/cmcluster/csisdbpkg1/control.sh
package halt timeout: (No Timeout)
package successor halt timeout: (No Timeout)
package primary node: orange
package alternate node: apple
package subnet: 10.0.4.0

package services: (Not Defined)

package dependencies: (Not Defined)

package access policies: (Not Defined)

package ID 26883:
package name: mwtranspkg1
package global flags: 5
(Package Switch Enabled)
(Package Local Switch Enabled)
(Configured Node Failover)
(Manual Failback)
package priority: (No Priority)
package run script: /etc/cmcluster/mwtranspkg1/control.sh
package run timeout: (No Timeout)
package halt script: /etc/cmcluster/mwtranspkg1/control.sh
package halt timeout: (No Timeout)
package successor halt timeout: (No Timeout)
package primary node: orange
package alternate node: apple
package subnet: 10.0.4.0

package services:
service ID: 1
service name: ssys1
service halt timeout: 300 (seconds)
service fail fast: Disabled

package dependencies: (Not Defined)

package access policies: (Not Defined)

package ID 57089:
package name: mwpackage1
package global flags: 5
(Package Switch Enabled)
(Package Local Switch Enabled)
(Configured Node Failover)
(Manual Failback)
package priority: (No Priority)
package run script: /etc/cmcluster/mwdbpkg1/control.sh
package run timeout: (No Timeout)
package halt script: /etc/cmcluster/mwdbpkg1/control.sh
package halt timeout: (No Timeout)
package successor halt timeout: (No Timeout)
package primary node: orange
package alternate node: apple
package subnet: 10.0.4.0

package services: (Not Defined)

package dependencies: (Not Defined)

package access policies: (Not Defined)

Regards,
FrogIsDeaf
TTr
Honored Contributor

Re: Reboot after panic: SafetyTimer expired

How is the HB connected? direct/cross-over or lan switch? and if lan switch is it managed or isolated? You should configure the production lan 67.2.3.2 as HB as well and use the third lan interface too.
The messages above point towards a lan disconnect for some reason. Did you check /var/adm/syslog/syslog.log for any lan failures? Are the nodes failrly well patches for OS and SG specific patches?