Re: Unexpected Server Shutdown..

Joaquin_13 · ‎06-08-2005

We have experianced an unexpected shutdown from one of our server. This server belongs to a cluster service. I don't know if the heartbeat of the cluster is the main cause of the problem. Does anyone have an idea about the error below is the Syslog.log before the server has halted.

Jun 8 08:47:31 batch cmcld[1414]: Warning: cmcld process was unable to run for the last 9 seconds
Jun 8 08:47:31 batch cmcld[1414]: Communication to node online has been interrupted
Jun 8 08:47:31 batch cmcld[1414]: Node online may have died
Jun 8 08:47:31 batch cmcld[1414]: Attempting to form a new cluster
Jun 8 08:47:34 batch cmcld[1414]: 2 nodes have formed a new cluster, sequence #222
Jun 8 08:47:34 batch cmcld[1414]: The new active cluster membership is: online(id=1), batch(id=2)
Jun 8 08:47:43 batch cmcld[1414]: Warning: cmcld process was unable to run for the last 4 seconds
Jun 8 08:47:43 batch cmcld[1414]: Communication to node online has been interrupted
Jun 8 08:47:43 batch cmcld[1414]: Node online may have died
Jun 8 08:47:43 batch cmcld[1414]: Attempting to form a new cluster
Jun 8 08:47:46 batch cmcld[1414]: 2 nodes have formed a new cluster, sequence #223
Jun 8 08:47:46 batch cmcld[1414]: The new active cluster membership is: online(id=1), batch(id=2)
Jun 8 08:49:06 batch vmunix:
Jun 8 08:49:06 batch vmunix: SCSI: Request Timeout -- lbolt: 234378731, dev: 1f001100
Jun 8 08:49:06 batch vmunix: lbp->state: 4020
Jun 8 08:49:06 batch vmunix: lbp->offset: 2f8
Jun 8 08:49:06 batch vmunix: lbp->uPhysScript: 480000
Jun 8 08:49:06 batch vmunix: From most recent interrupt:
Jun 8 08:49:06 batch vmunix: ISTAT: 2a, SIST0: 90, SIST1: 00, DSTAT: 80, DSPS: 01e80000
Jun 8 08:49:06 batch vmunix: NCR chip register access history (most recent last): 134862816 accesses
Jun 8 08:49:06 batch vmunix: 50, ISTAT<-20
Jun 8 08:49:06 batch vmunix: 1694, ISTAT<-20
Jun 8 08:49:06 batch vmunix: 914, ISTAT<-20
Jun 8 08:49:06 batch vmunix: 896, ISTAT<-20
Jun 8 08:49:06 batch vmunix: 0, ISTAT<-20
Jun 8 08:49:06 batch vmunix: 6901, ISTAT<-20
Jun 8 08:49:06 batch vmunix: 764, ISTAT<-20
Jun 8 08:49:06 batch vmunix: 885, ISTAT<-20
Jun 8 08:49:06 batch vmunix: 890, ISTAT<-20
Jun 8 08:49:06 batch vmunix: 896, ISTAT<-20
Jun 8 08:49:06 batch vmunix: 0, ISTAT<-20
Jun 8 08:49:06 batch vmunix: 0, ISTAT<-20
Jun 8 08:49:06 batch vmunix: 7145, ISTAT<-20
Jun 8 08:49:06 batch vmunix: 2700, ISTAT<-20
Jun 8 08:49:06 batch vmunix: 58, ISTAT<-20
Jun 8 08:49:06 batch vmunix: 0, ISTAT<-20
Jun 8 08:49:06 batch vmunix: lsp: 285a4e00
Jun 8 08:49:06 batch vmunix: bp->b_dev: 1f001100
Jun 8 08:49:06 batch vmunix: scb->io_id: d3c725
Jun 8 08:49:06 batch vmunix: scb->cdb: 28 00 00 72 52 90 00 00 10 00
Jun 8 08:49:06 batch vmunix: lbolt_at_timeout: 234378531, lbolt_at_start: 234375531
Jun 8 08:49:06 batch vmunix: lsp->state: 205
Jun 8 08:49:06 batch vmunix: lbp->owner: 26238280
Jun 8 08:49:06 batch vmunix: bp->b_dev: 1f001500
Jun 8 08:49:06 batch vmunix: scb->io_id: d3c726
Jun 8 08:49:06 batch vmunix: scb->cdb: 28 00 00 72 52 90 00 00 10 00
Jun 8 08:49:06 batch vmunix: lbolt_at_timeout: 234375531, lbolt_at_start: 234375531
Jun 8 08:49:06 batch vmunix: lsp->state: 10d
Jun 8 08:49:06 batch vmunix:
Jun 8 08:49:06 batch vmunix: SCSI: Request Timeout -- lbolt: 234378731, dev: 1f001500
Jun 8 08:49:06 batch vmunix: lbp->state: 4020
Jun 8 08:49:06 batch vmunix: lbp->offset: ffffffff
Jun 8 08:49:06 batch vmunix: lbp->uPhysScript: 480000
Jun 8 08:49:06 batch vmunix: From most recent interrupt:
Jun 8 08:49:06 batch vmunix: ISTAT: 22, SIST0: 04, SIST1: 00, DSTAT: 80, DSPS: 00000006
Jun 8 08:49:06 batch vmunix: NCR chip register access history (most recent last): 10 accesses
Jun 8 08:49:06 batch vmunix: 6702, ISTAT: 22
Jun 8 08:49:06 batch vmunix: 4, SIST0: 04
Jun 8 08:49:06 batch vmunix: 2, SIST1: 00
Jun 8 08:49:06 batch vmunix: 2, DSTAT: 80
Jun 8 08:49:06 batch vmunix: 1, DSPS: 00000006
Jun 8 08:49:06 batch vmunix: 1, SCRATCHA: ff000840
Jun 8 08:49:06 batch vmunix: 3, DSP: 00480058
Jun 8 08:49:06 batch vmunix: 2, SCRATCHA1<-00
Jun 8 08:49:06 batch vmunix: 2, CTEST3<-04
Jun 8 08:49:06 batch vmunix: 0, STEST3<-82
Jun 8 08:49:06 batch vmunix: lsp: 26238280
Jun 8 08:49:06 batch vmunix: bp->b_dev: 1f001500
Jun 8 08:49:06 batch vmunix: scb->io_id: d3c726
Jun 8 08:49:06 batch vmunix: scb->cdb: 28 00 00 72 52 90 00 00 10 00
Jun 8 08:49:06 batch vmunix: lbolt_at_timeout: 234375531, lbolt_at_start: 234375531
Jun 8 08:49:06 batch vmunix: lsp->state: 10d
Jun 8 08:49:06 batch vmunix: lbp->owner: 26238280
Jun 8 08:49:06 batch vmunix: scratch_lsp: 0
Jun 8 08:49:06 batch vmunix: Pre-DSP script dump [16b97030]:
Jun 8 08:49:06 batch vmunix: 78344000 0000000a 78350800 00000000
Jun 8 08:49:06 batch vmunix: 0e000004 004803c0 80000000 00000000
Jun 8 08:49:06 batch vmunix: Script dump [16b97050]:
Jun 8 08:49:06 batch vmunix: 9f0b0000 00000006 0a000000 004803c8
Jun 8 08:49:06 batch vmunix: 721a0000 00000000 c0000004 0048035c
Jun 8 08:49:06 batch vmunix:
Jun 8 08:49:06 batch vmunix: SCSI: Abort Tag -- lbolt: 234378731, dev: 1f001500, io_id: d3c726
Jun 8 08:49:06 batch vmunix: LVM: vg[2]: pvnum=13 (dev_t=0x1f001500) is POWERFAILED
Jun 8 08:49:06 batch vmunix:
Jun 8 08:49:06 batch vmunix: SCSI: Abort Tag -- lbolt: 234378731, dev: 1f001100, io_id: d3c725
Jun 8 08:49:06 batch vmunix: LVM: vg[2]: pvnum=9 (dev_t=0x1f001100) is POWERFAILED
Jun 8 08:49:06 batch vmunix: DIAGNOSTIC SYSTEM WARNING:
Jun 8 08:49:06 batch vmunix: The diagnostic logging facility has started receiving excessive
Jun 8 08:49:06 batch vmunix: errors from the I/O subsystem. I/O error entries will be lost
Jun 8 08:49:06 batch vmunix: until the cause of the excessive I/O logging is corrected.
Jun 8 08:49:06 batch vmunix: If the DEMLOG daemon is not active, use the DIAGSYSTEM command
Jun 8 08:49:06 batch vmunix: in SYSDIAG to start it.
Jun 8 08:49:06 batch vmunix: If the DEMLOG daemon is active, use the LOGTOOL utility in SYSDIAG
Jun 8 08:49:06 batch vmunix: to determine which I/O subsystem is logging excessive errors.
Jun 8 08:49:06 batch vmunix: LVM: PV 9 has been returned to vg[2].
Jun 8 08:49:06 batch vmunix: LVM: PV 13 has been returned to vg[2].
Jun 8 08:49:25 batch cmcld[1414]: Warning: cmcld process was unable to run for the last 4 seconds
Jun 8 08:49:25 batch cmcld[1414]: Communication to node online has been interrupted
Jun 8 08:49:25 batch cmcld[1414]: Node online may have died
Jun 8 08:49:25 batch cmcld[1414]: Attempting to form a new cluster
Jun 8 08:49:28 batch vmunix: DIAGNOSTIC SYSTEM WARNING:
Jun 8 08:49:28 batch vmunix: The diagnostic logging facility is no longer receiving excessive
Jun 8 08:49:28 batch vmunix: errors from the I/O subsystem. 29 I/O error entries were lost.
Jun 8 08:49:28 batch cmcld[1414]: 2 nodes have formed a new cluster, sequence #224
Jun 8 08:49:28 batch cmcld[1414]: The new active cluster membership is: online(id=1), batch(id=2)

Patrick Wallek · ‎06-08-2005

Here are the 3 messages that I see as key:

Jun 8 08:49:25 batch cmcld[1414]: Warning: cmcld process was unable to run for the last 4 seconds
Jun 8 08:49:25 batch cmcld[1414]: Communication to node online has been interrupted
Jun 8 08:49:25 batch cmcld[1414]: Node online may have died

The clcld daemon could not run for 4 seconds, which may be a bit of a long time. Since it couldn't it likely could not communicate with the other node in the cluster. In the case of a 2 node cluster, one machine will then TOC so that a valid cluster will be up. The machine that TOC's depends on which machine takes the lock disk first (if you're using a lock disk).

Now also of concern is that you have a LOT of disk errors. This may or may not be related. It appears that at least 2 PVs were temporarily inaccessible. I would also investigate these errors.

Steven E. Protter · ‎06-08-2005

A disk the cluster needs has died.

I believe this is the source of most of the problems.

Solve the I/O disk problems, restore any lost data and try again.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Devesh Pant_1 · ‎06-08-2005

I would like to have some more details on how the cluster is configured. Do you have lock disk or a quorum server?

Re-formation of the cluster occurs under the following conditions (not a complete list):

1. An SPU or network failure was detected on an active node.
2. An inactive node wants to join the cluster. The cluster manager daemon has been started on that node.
3. A node has been added to or deleted from the cluster configuration.
4. The system administrator halted a node.
5. A node halts because of a package failure.
6. A node halts because of a service failure.
7. Heavy network traffic prohibited the heartbeat signal from being received by the cluster.
8. The heartbeat network failed, and another network is not configured to carry heartbeat.

You might have to check the uptime on both the servers whether one of them actually rebooted or is it just because of the heartbeat issues. It looks like the inaccessibility to the daemon happens but the cluster starts to reform with the same two nodes online and batch.
Another area to look into would be the cluster configuration file as to what are the values of the timeout parameters.
package failover scripts may also be verified.
Use the commands cmviewconf -v
and cmviewcl -v
look into the /etc/cmcluster/package/packagename.log

thanks
Devesh

Joaquin_13 · ‎06-08-2005

I have this log on the other node where the prior server is clustered.

Jun 9 05:16:23 online cmcld[192]: Timed out node batch. It may have failed.
Jun 9 05:16:23 online cmcld[192]: Attempting to adjust cluster membership
Jun 9 05:16:26 online cmcld[192]: Obtaining Cluster Lock
Jun 9 05:16:27 online vmunix:
Jun 9 05:16:27 online vmunix: SCSI: Reset requested from above -- lbolt: 742672142, bus: 0
Jun 9 05:16:27 online cmcld[192]: Turning off safety time protection since the cluster
Jun 9 05:16:27 online cmcld[192]: now consists of a single node. If ServiceGuard
Jun 9 05:16:27 online cmcld[192]: fails, this node will not automatically halt
Jun 9 05:16:39 online vmunix:
Jun 9 05:16:39 online vmunix: SCSI: Resetting SCSI -- lbolt: 742672242, bus: 0
Jun 9 05:16:39 online vmunix:
Jun 9 05:16:39 online vmunix: SCSI: Reset detected -- lbolt: 742672242, bus: 0
Jun 9 05:16:39 online cmcld[192]: Did not receive vote from node batch in time
Jun 9 05:16:51 online cmcld[192]: 1 nodes have formed a new cluster, sequence #226
Jun 9 05:16:51 online cmcld[192]: The new active cluster membership is: online(id=1)

I remember HP saying that we should try to adjust the TOC of the server( recommended is 8 sec). they already did a diagnostic check & come up to this & not detecting any hardware error. I'm just wondering what causes this error ,i think maybe on the LAN connection plus the Net Card (NIC ) below also is the config of our cluster:

Cluster information:

cluster name: BSCS_RTX_CLUSTER
version: 0
flags: 12 (single cluster lock)
heartbeat interval: 1.00 (seconds)
node timeout: 2.00 (seconds)
heartbeat connection timeout: 4.00 (seconds)
auto start timeout: 600.00 (seconds)
network polling interval: 2.00 (seconds)
first lock vg name: /dev/vgbscs
second lock vg name: (not configured).
.....
If such configuration is needed do you know where can configure this setting, is it just a config file or an application is needed?& is there any impact on our system,DB & application if we made some miscalculation on the configuration?

thank's & regards

Devesh Pant_1 · ‎06-08-2005

As i mentioned in my earlier post use cmviewconf command to view the configuration.
These values can be found in the file /etc/cmcluster/clustername

to make changes shut down the cluster with
cmhaltcl -f
After making changes you will have to do a cmcheckconf -k -v -C /etc/cmcluster/clustername

verify that there are no errors and then apply the configuration
cmapplyconf -k -v -C /etc/cmcluster/clustername

Bring up the cluster after that
cmruncl

Please also check the status of the cluster with the command cmviewcl and cmviewcl -v

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Unexpected Server Shutdown..

Unexpected Server Shutdown..