- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- Reboot after panic: SafetyTimer expired
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-24-2010 05:10 AM
тАО05-24-2010 05:10 AM
Reboot after panic: SafetyTimer expired
here is the output from cmfmtfr frdump.cmcld.9
i have checked the systems logs and there is no resource crunch for the safety timer to expire.
=======
May 23 11:53:08:1:CLM:09: Frequent Action - HB from node apple
May 23 11:53:08:1:CLM:09: Frequent Action - Updated safety time to 79361781
May 23 11:53:08:1:CLM:09: Frequent Action - HB to node apple, 64199
May 23 11:53:09:1:CLM:09: Frequent Action - HB from node apple
May 23 11:53:09:1:CLM:09: Frequent Action - Updated safety time to 79361882
May 23 11:53:09:1:CLM:09: Frequent Action - HB to node apple, 64200
May 23 11:53:10:1:CLM:09: Frequent Action - HB from node apple
May 23 11:53:10:1:CLM:09: Frequent Action - Updated safety time to 79361983
May 23 11:53:10:1:CLM:09: Frequent Action - HB to node apple, 64201
May 23 11:53:11:1:CLM:09: Frequent Action - HB from node apple
May 23 11:53:11:1:CLM:09: Frequent Action - Updated safety time to 79362084
May 23 11:53:11:1:CLM:09: Frequent Action - HB to node apple, 64202
May 23 11:53:14:4:LOC:07: Event - Handling reading from local connection on fd 20.
May 23 11:53:14:2:STA:06: Action - sdb request 7 on port 8
May 23 11:53:14:2:STA:06: Action - srv_sdb_set_status_private: Set status 1 0/1/2/0 for client=4538 (size 124)
May 23 11:53:14:2:SDB:06: Action - st_set_status: Set status entry for 0/1/2/0 of type 1 on node_id 1.
May 23 11:53:14:2:SDB:06: Action - st_set_status: valuesize = 124, historysize = 0.
May 23 11:53:14:4:LOC:07: Event - Handling reading from local connection on fd 20.
May 23 11:53:14:2:STA:06: Action - sdb request 7 on port 8
May 23 11:53:14:2:STA:06: Action - srv_sdb_set_status_private: Set status 1 0/4/1/0/6/0 for client=4538 (size 124)
May 23 11:53:14:2:SDB:06: Action - st_set_status: Set status entry for 0/4/1/0/6/0 of type 1 on node_id 1.
May 23 11:53:14:2:SDB:06: Action - st_set_status: valuesize = 124, historysize = 0.
May 23 11:53:14:4:LOC:07: Event - Handling reading from local connection on fd 20.
May 23 11:53:14:2:STA:06: Action - sdb request 7 on port 8
May 23 11:53:14:2:STA:06: Action - srv_sdb_set_status_private: Set status 4 10.0.4.0 for client=4538 (size 76)
May 23 11:53:14:2:SDB:06: Action - st_set_status: Set status entry for 10.0.4.0 of type 4 on node_id 1.
May 23 11:53:14:2:SDB:06: Action - st_set_status: valuesize = 76, historysize = 0.
May 23 11:53:14:2:PKG:06: Action - pm_subnet_callback: Check for subnet 10.0.4.0 with status of 1
May 23 11:53:14:2:PKG:06: Action - pm_subnet_check: Check for package 36866 with subnet status of 1
May 23 11:53:14:3:PKG:06: Action - pm_subnet_check: subnet down, notify not eligible
May 23 11:53:14:2:PKG:06: Action - pm_subnet_check: Check for package 26883 with subnet status of 1
May 23 11:53:14:3:PKG:06: Action - pm_subnet_check: subnet down, notify not eligible
May 23 11:53:14:2:PKG:06: Action - pm_subnet_check: Check for package 57089 with subnet status of 1
May 23 11:53:14:3:PKG:06: Action - pm_subnet_check: subnet down, notify not eligible
May 23 11:53:14:4:LOC:07: Event - Handling reading from local connection on fd 20.
May 23 11:53:14:2:STA:06: Action - sdb request 7 on port 8
May 23 11:53:14:2:STA:06: Action - srv_sdb_set_status_private: Set status 4 10.0.4.0 for client=4538 (size 76)
May 23 11:53:14:2:SDB:06: Action - st_set_status: Set status entry for 10.0.4.0 of type 4 on node_id 1.
May 23 11:53:14:2:SDB:06: Action - st_set_status: valuesize = 76, historysize = 0.
May 23 11:53:14:2:SDB:06: Action - Status entry value for 10.0.4.0 of type 4 on node_id 1 did not change.
May 23 11:53:14:4:LOC:07: Event - Handling reading from local connection on fd 20.
May 23 11:53:14:2:STA:06: Action - sdb request 7 on port 8
May 23 11:53:14:2:STA:06: Action - srv_sdb_set_status_private: Set status 1 0/1/2/0 for client=4538 (size 124)
May 23 11:53:14:2:SDB:06: Action - st_set_status: Set status entry for 0/1/2/0 of type 1 on node_id 1.
May 23 11:53:14:2:SDB:06: Action - st_set_status: valuesize = 124, historysize = 0.
May 23 11:53:14:2:SDB:06: Action - Status entry value for 0/1/2/0 of type 1 on node_id 1 did not change.
May 23 11:53:14:4:LOC:07: Event - Handling reading from local connection on fd 20.
May 23 11:53:14:2:STA:06: Action - sdb request 7 on port 8
May 23 11:53:14:2:STA:06: Action - srv_sdb_set_status_private: Set status 1 0/4/1/0/6/0 for client=4538 (size 124)
May 23 11:53:14:2:SDB:06: Action - st_set_status: Set status entry for 0/4/1/0/6/0 of type 1 on node_id 1.
May 23 11:53:14:2:SDB:06: Action - st_set_status: valuesize = 124, historysize = 0.
May 23 11:53:17:0:CLM:08: Event - Timed out node apple. It may have failed.
May 23 11:53:17:2:CLM:08: Action - apple: eligible=1, received=0
May 23 11:53:17:2:CLM:08: Action - apple: no_st_update, ready_st=1, first_hb_recvd=1
May 23 11:53:17:2:CLM:08: External error - Detected failure event for node apple
May 23 11:53:17:1:UTL:08: Action - Automatic dump request is issued.
May 23 11:53:17:2:UTL:08: Action - fr_auto_dump_cmcld() queued an event
May 23 11:53:17:1:CLM:08: Action - Removing node apple from the running cluster
May 23 11:53:17:2:CLM:08: Action - Disconnecting HB comm to node apple
May 23 11:53:17:3:REM:08: Action - Config status for node 2 in service 2 is DISCONNECT
May 23 11:53:17:3:REM:08: Action - Inbound to 10.0.4.3 at node 2 in service 2 is CLOSING
May 23 11:53:17:3:REM:08: Action - Outbound to 10.0.4.3 at node 2 in service 2 is CLOSING
May 23 11:53:17:3:REM:08: Action - Connection to 10.0.4.3 at node 2 in service 2 is CLOSING
May 23 11:53:17:3:REM:08: Action - Node 2 in service 2 is CLOSING
May 23 11:53:17:2:UTL:06: Action - fr_dump_cmcld_event_handler() get request
May 23 11:53:17:2:CLM:08: Action - Decrementing old votes to 1
May 23 11:53:17:2:CLM:08: Action - Changed apple's cm_state from RUNNING to RECONFIG
May 23 11:53:17:2:CLM:08: Action - Removing node apple from heartbeat list
May 23 11:53:17:2:CLM:08: Action - Cancelled HB timers for node 2
May 23 11:53:17:1:CLM:08: Action - Got exactly 50% of the votes: 1 out of 2 last active nodes.
May 23 11:53:17:0:CLM:08: Action - Attempting to adjust cluster membership
May 23 11:53:17:0:CLM:08: Action - Beginning standard partial election
May 23 11:53:17:1:CLM:08: Action - Safety time set for 68.96 seconds from now
May 23 11:53:17:2:CLM:08: Action - Changed inbapplive1's cm_state from 5 to RECONFIG
May 23 11:53:17:2:SDB:08: Action - st_set_status: Set status entry for CM_STATUS_NAME of type 3 on node_id 1.
May 23 11:53:17:2:SDB:08: Action - st_set_status: valuesize = 104, historysize = 0.
May 23 11:53:17:2:UTL:08: Action - Starting sync timer of 63.960000 seconds
May 23 11:53:17:2:CLM:08: Action - Entering CANDIDATE state
May 23 11:53:17:1:CLM:08: Action - Did not receive all votes: 1 out of 2
May 23 11:53:17:1:CLM:08: Action - All votes (100%) are required at this point.
May 23 11:53:17:2:CLM:08: Action - Starting election timer of 7000000 usec
May 23 11:53:17:1:UTL:06: Action - Automatic dump is scheduled after 1000000 usec.
May 23 11:53:17:2:STA:06: Action - srv_sdb_process_callbacks: Received callbk for type(3), name(CM_STATUS_NAME), priority(25).
May 23 11:53:17:2:STA:06: Action - srv_sdb_process_sync_callbacks: No client is registered for 3 CM_STATUS_NAME at 25.
May 23 11:53:17:2:STA:06: Action - sdb async callbacks for 3
May 23 11:53:17:3:STA:06: Action - sdb client 4538 port 8
May 23 11:53:17:3:REM:09: Action - Inbound to 10.0.4.3 at node 2 in service 2 is NONE
May 23 11:53:17:3:REM:09: Action - Outbound to 10.0.4.3 at node 2 in service 2 is NONE
May 23 11:53:17:3:REM:09: Action - Connection to 10.0.4.3 at node 2 in service 2 is NONE
May 23 11:53:17:3:REM:09: Action - Busy conns is 0 for node 2 in service 2
May 23 11:53:17:3:REM:09: Action - Node 2 in service 2 is DROPPING
May 23 11:53:17:3:REM:09: Action - Node 2 in service 2 is DROPPED
May 23 11:53:18:2:UTL:08: Action - FR: start dump
May 23 11:53:18:3:UTL:08: Action - Automatic dump is not scheduled now.
Regards ,
Rahul
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-24-2010 05:25 AM
тАО05-24-2010 05:25 AM
Re: Reboot after panic: SafetyTimer expired
Was the heartbeat network interrupted? This looks like one cluster node went TOC to avoid data corruption.
SEP
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-24-2010 05:34 AM
тАО05-24-2010 05:34 AM
Re: Reboot after panic: SafetyTimer expired
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-24-2010 05:43 AM
тАО05-24-2010 05:43 AM
Re: Reboot after panic: SafetyTimer expired
Many thanks for such a quick reply.I just had talk with one of our networking folks,
they deny any activity at their end.Also , other cluster , though in different vlan , doesn't show similar logs.
Regards ,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-24-2010 06:05 AM
тАО05-24-2010 06:05 AM
Re: Reboot after panic: SafetyTimer expired
'cmscancl -n node -o /tmp/outfile' will perform linkloop tests on all nics for all nodes. Note that MCSG is mac address driven and not ip address driven.
'cmreadlog /var/opt/cmon/cmond.d', and,
'cmreadlog /var/opt/sgmgr/#######mgr.log' if you have later version of MCSG. These commands and logs were not available in earlier versions.
Also refer to 'cmgetconf' to get current cluster configuration w/in the Kernel ( as opposed to using the cluster.ascii file ), and 'cmviewconf' also useful in troubleshooting.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-24-2010 06:40 AM
тАО05-24-2010 06:40 AM
Re: Reboot after panic: SafetyTimer expired
I want to know if the cluster configuration , n/w etc. have caused this.
One of the node went down unexpectedly.
As far as i can make out , HBAs are ok.
Hello Michael ,
linkloop b/w nodes is OK.
I have the latest version of SG , but there is no /var/opt/cmon/cmond.d'
and /var/opt/sgmgr/hpsmh/log is zero.
cmviewconf is attached
Regards,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-24-2010 06:49 AM
тАО05-24-2010 06:49 AM
Re: Reboot after panic: SafetyTimer expired
Cluster information:
cluster name: mwcluster1
version: 0
flags: 12 (single cluster lock)
heartbeat interval: 1.00 (seconds)
node timeout: 6.00 (seconds)
heartbeat connection timeout: 0.00 (seconds)
io timeout extension: 0.00 (seconds)
auto start timeout: 600.00 (seconds)
network polling interval: 2.00 (seconds)
network failure detection: INOUT
first lock vg name: /dev/vg02
second lock vg name: (not configured)
qs host: (not configured)
Cluster Node information:
Node ID 2:
Node name: orange
first lock pv name: /dev/dsk/c10t0d1
first lock disk interface type: fcd_vbus
Network ID 1:
ppa: 0
old_ppa: 0
mac addr: 0x001cc4fca46d
hardware path: 0/1/2/0
network interface name: lan0
IPv4 Information:
subnet: 10.0.4.0
subnet mask: 255.255.255.0
ip address: 10.0.4.3
route id: 0
IPv6 Information:
flags: 5 (Heartbeat Network)
bridged net ID: 1
Network ID 3:
ppa: 2
old_ppa: 0
mac addr: 0x00215a9d6008
hardware path: 0/2/1/0/6/0
network interface name: lan2
IPv4 Information:
subnet: 67.0.0.0
subnet mask: 255.0.0.0
ip address: 67.2.3.2
route id: 0
IPv6 Information:
flags: 4 (Non-Heartbeat Network)
bridged net ID: 2
Network ID 2:
ppa: 3
old_ppa: 0
mac addr: 0x00215a9d6014
hardware path: 0/4/1/0/6/0
network interface name: lan3
IPv4 Information:
subnet: 0.0.0.0
subnet mask: 0.0.0.0
ip address: 0.0.0.0
route id: 0
IPv6 Information:
flags: 2 (Non-Heartbeat Network)
bridged net ID: 1
Node ID 1:
Node name: apple
first lock pv name: /dev/dsk/c14t0d1
first lock disk interface type: fcd_vbus
Network ID 1:
ppa: 0
old_ppa: 0
mac addr: 0x001cc4fca4ad
hardware path: 0/1/2/0
network interface name: lan0
IPv4 Information:
subnet: 10.0.4.0
subnet mask: 255.255.255.0
ip address: 10.0.4.2
route id: 0
IPv6 Information:
flags: 5 (Heartbeat Network)
bridged net ID: 1
Network ID 3:
ppa: 2
old_ppa: 0
mac addr: 0x00215a9d6017
hardware path: 0/2/1/0/6/0
network interface name: lan2
IPv4 Information:
subnet: 67.0.0.0
subnet mask: 255.0.0.0
ip address: 67.2.3.1
route id: 0
IPv6 Information:
flags: 4 (Non-Heartbeat Network)
bridged net ID: 2
Network ID 2:
ppa: 3
old_ppa: 0
mac addr: 0x00215a9d6019
hardware path: 0/4/1/0/6/0
network interface name: lan3
IPv4 Information:
subnet: 0.0.0.0
subnet mask: 0.0.0.0
ip address: 0.0.0.0
route id: 0
IPv6 Information:
flags: 2 (Non-Heartbeat Network)
bridged net ID: 1
Cluster Access Policy Information: (Not Defined)
Package information:
maximum configured packages: 150
package ID 36866:
package name: csisdbpkg1
package global flags: 5
(Package Switch Enabled)
(Package Local Switch Enabled)
(Configured Node Failover)
(Manual Failback)
package priority: (No Priority)
package run script: /etc/cmcluster/csisdbpkg1/control.sh
package run timeout: (No Timeout)
package halt script: /etc/cmcluster/csisdbpkg1/control.sh
package halt timeout: (No Timeout)
package successor halt timeout: (No Timeout)
package primary node: orange
package alternate node: apple
package subnet: 10.0.4.0
package services: (Not Defined)
package dependencies: (Not Defined)
package access policies: (Not Defined)
package ID 26883:
package name: mwtranspkg1
package global flags: 5
(Package Switch Enabled)
(Package Local Switch Enabled)
(Configured Node Failover)
(Manual Failback)
package priority: (No Priority)
package run script: /etc/cmcluster/mwtranspkg1/control.sh
package run timeout: (No Timeout)
package halt script: /etc/cmcluster/mwtranspkg1/control.sh
package halt timeout: (No Timeout)
package successor halt timeout: (No Timeout)
package primary node: orange
package alternate node: apple
package subnet: 10.0.4.0
package services:
service ID: 1
service name: ssys1
service halt timeout: 300 (seconds)
service fail fast: Disabled
package dependencies: (Not Defined)
package access policies: (Not Defined)
package ID 57089:
package name: mwpackage1
package global flags: 5
(Package Switch Enabled)
(Package Local Switch Enabled)
(Configured Node Failover)
(Manual Failback)
package priority: (No Priority)
package run script: /etc/cmcluster/mwdbpkg1/control.sh
package run timeout: (No Timeout)
package halt script: /etc/cmcluster/mwdbpkg1/control.sh
package halt timeout: (No Timeout)
package successor halt timeout: (No Timeout)
package primary node: orange
package alternate node: apple
package subnet: 10.0.4.0
package services: (Not Defined)
package dependencies: (Not Defined)
package access policies: (Not Defined)
Regards,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО05-24-2010 10:07 AM
тАО05-24-2010 10:07 AM
Re: Reboot after panic: SafetyTimer expired
The messages above point towards a lan disconnect for some reason. Did you check /var/adm/syslog/syslog.log for any lan failures? Are the nodes failrly well patches for OS and SG specific patches?