Time synchronization in a cluster

Karl-Heinz Kwarda · ‎03-19-2009

Hi,

I have the following problem in my cluster (built from two Alpha server, node A and node B):
- NTP is enabled on both nodes
- both nodes should get the time from our timeserver.
- node B gets the time, node A not
I now have a time difference of about 5 minutes between node A and node B.

What dis I wrong, or what can I do to get rid of the problem ???
Any hint is greatly appreciated.

Regards,
Karl-Heinz

Oswald Knoppers_1 · ‎03-19-2009

You can do the following:

$ @sys$manager:tcpip$define_commands

and then

$ ntpq -pn

and

$ ntptrace -n

Then check for any differences.

Oswald

marsh_1 · ‎03-19-2009

hi,

are there any differences in the config files in sys$specific:[tcpip$ntp] ?

Karl-Heinz Kwarda · ‎03-19-2009

Oswald and Mike,

the TCPIP$NTP.CONF files are identical on both systems (see attachment)

The result from the mentioned commands are:
$ NODE_B>ntpq -pn
remote refid st t when poll reach delay offset jitter
==============================================================================
*10.20.96.171 .hopf. 1 u 669 1024 377 10.697 -2.318 0.618

$ NODE_B>ntptrace -n
127.0.0.1: stratum 2, offset 0.000000, synch distance 0.03075
10.20.96.171: stratum 1, offset -0.000395, synch distance 0.00224, refid 'hopf'

$ NODE_A>ntpq -pn
No association ID's returned

$ NODE_A>ntptrace -n
127.0.0.1: stratum 16, offset -0.000488, synch distance 1.19597
0.0.0.0: *Not Synchronized*

Regards, Karl-Heinz

Oswald Knoppers_1 · ‎03-19-2009

So node A apparently cannot contact 10.20.96.171. Can you ping this address from node a? Or do a:

$ traceroute -n 10.20.96.171

Oswald

marsh_1 · ‎03-19-2009

hi,

any firewalls in the way ?

Karl-Heinz Kwarda · ‎03-19-2009

Hi,

no firewalls, traceroute works :
$ NODE_A>traceroute -n 10.20.96.171
traceroute to 10.20.96.171 (10.20.96.171): 1-30 hops, 38 byte packets
1 193.26.202.65 2.93 ms 1.95 ms 1.95 ms
2 193.26.200.1 9.76 ms 9.76 ms 9.76 ms
3 193.26.203.10 10.7 ms 9.76 ms 10.7 ms
4 10.1.200.250 10.7 ms 11.7 ms 10.7 ms
5 10.20.96.171 11.7 ms 10.7 ms 12.6 ms

Regards,
Karl-Heinz

marsh_1 · ‎03-19-2009

can nodea resolve that host name mentioned in the conf file and is there anything in the log ?

Karl-Heinz Kwarda · ‎03-19-2009

This is the output of a ping command:
$ NODE_A>tcpip ping timenet.eur.ad.sag
PING timenet.eur.ad.sag (10.20.96.171): 56 data bytes
64 bytes from 10.20.96.171: icmp_seq=0 ttl=124 time=17 ms
64 bytes from 10.20.96.171: icmp_seq=1 ttl=124 time=14 ms
64 bytes from 10.20.96.171: icmp_seq=2 ttl=124 time=33 ms
64 bytes from 10.20.96.171: icmp_seq=3 ttl=124 time=12 ms

----timenet.eur.ad.sag PING Statistics----
4 packets transmitted, 4 packets received, 0% packet loss
round-trip (ms) min/avg/max = 12/19/33 ms

Looks fine, which log file do you mean ?
Karl-Heinz

Oswald Knoppers_1 · ‎03-19-2009

sys$specific:[tcpip$ntp]tcpip$ntp_run.log is the logfile.

Oswald

Karl-Heinz Kwarda · ‎03-19-2009

Oswald,

this is strange, I had restarted the NTP service at 12:09 h but didn't get a new entry in the log file. This is how the log looks like:
$ NODE_A>type TCPIP$NTP_RUN.LOG;1562
17 Mar 10:53:00 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
17 Mar 11:53:05 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
17 Mar 12:53:09 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
17 Mar 13:53:14 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
17 Mar 14:53:19 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
17 Mar 15:53:23 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
17 Mar 16:53:27 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
17 Mar 17:53:32 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
17 Mar 18:53:36 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
17 Mar 19:53:41 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
17 Mar 20:53:46 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
17 Mar 21:53:51 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
17 Mar 22:53:57 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
17 Mar 23:54:02 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
18 Mar 00:54:07 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
18 Mar 01:54:11 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
18 Mar 02:54:15 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
18 Mar 03:54:18 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
18 Mar 04:54:22 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
18 Mar 05:54:26 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
18 Mar 06:54:31 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
18 Mar 07:54:34 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488
18 Mar 08:54:39 ntp[539747492]: offset: 0.000000 sec freq: 20.544 ppm poll: 16 sec error: 0.000488

The directory SYS$SPECIFIC:[TCPIP$NTP] contains the following files:
$ NODE_A>dir/date

Directory SYS$SPECIFIC:[TCPIP$NTP]

LOGIN.COM;1 16-NOV-2004 09:21:59.39
NTP33e6c0.;1 19-MAR-2009 12:09:04.82
NTP34fe84.;1 18-MAR-2009 13:00:51.59
TCPIP$NTP.CONF;4 18-MAR-2009 13:00:06.76
TCPIP$NTP.DRIFT;11681
19-MAR-2009 12:02:36.56
TCPIP$NTP.TEMPLATE;1
16-NOV-2004 09:21:59.43
TCPIP$NTP_RUN.LOG;1562
17-MAR-2009 10:53:00.54
TCPIP$NTP_RUN.LOG;1561
16-MAR-2009 10:51:09.33
TCPIP$NTP_RUN.LOG;1560
15-MAR-2009 10:49:16.25
TCPIP$NTP_RUN.LOG;1559
14-MAR-2009 10:47:30.91
TCPIP$NTP_RUN.LOG;1558
13-MAR-2009 10:45:34.38

Total of 11 files.

Karl-Heinz

Hoff · ‎03-19-2009

Run a few Google searches for /"not synchronized" ntp/ and such; there are various potential causes and - given the ntp client is based on UDEL, mostly common - work through some of this. That search will be faster than ITRC.

There is also a section within the ntp documentation (for the TCP/IP Services product and I presume similar sections exist in the documentation of other IP stacks) on when the box will accept or reject arriving ntp time. That's fairly general,

Ensure the ntp on the box is patched to current.

Confirm that there are no other time-related services active on the troublesome node; that DTSS isn't active. (Depending on your OpenVMS VAX or OpenVMS Alpha or OpenVMS I64 release, there is a bootstrap-time logical name around that shuts off DTSS before it starts.)

I'm a mildly surprised nobody has suggested working with and testing with "ntpdate -q" here, as well.

Folks have mentioned firewalls but (given the numbers of subnets I see there) it's also easily feasible for the managed switches and VLANs likely in use here to be helpfully and silently dropping some of your traffic into the bit bucket.

Never trust your connectivity when you have a managed LAN around. (Which is one of the reasons it's best to test with the protocol itself. ping is good for the serious router and switch configuration mistakes, but then you need to move up the stack.)

Get your network folks involved, and see if they've done something to block the ntp UDP traffic here.

Jim_McKinney · ‎03-19-2009

You might consider having your A-node fetch time from the B-node.

When using NTP in a cluster, I generally only have one node poll the external time source and have any other cluster members poll that node. Additionally, I configure that cluster node that is externally facing to also serve as a local master clock (usually at some high strata value such as 8 or so). By doing this, if the link to the external clock fails, the node that is the local master will continue to serve up time from its own system clock even though it has lost connection with its external time source. The benefit of this is that all clocks in the cluster will remain in synch even if they are not necessarily in synch with a reliable clock. Since mosts cluster members run common applications, I find that having the clocks on all members in sync is usually more important than them being exactly correct. Your application may dictate otherwise. (I imagine that the HP's TCPIP supports the local master concept via a directive such as "local_master 8" or somesuch in the TCPIP$NTP.CONF that you previously referenced - I'm a long time MultiNet user and don't know how HP's stack functions in this area).

I realize that this does not address the issue of why your Node-A is not synching to your external time source - just presenting you a different perspective on time management in a cluster.

Oswald Knoppers_1 · ‎03-19-2009

Are you sure the configurations files are the same? Your attachment (a couple of replies before) only shows the configuration file of node B.

Oswald

Karl-Heinz Kwarda · ‎03-19-2009

Jim,

this sounds good to me. I don't persist in getting the time for both nodes from an external timeserver. It is ok for me when one system gets the time.
What is to do that NODE_B is now the timeserver for NODE_A ?
Can you provide an example how TCPIP$NTP.CONF must look like on both nodes ?

Karl-Heinz

The Brit · ‎03-19-2009

Jim's solution works very well. This is how we do it at my site. One Cluster node acts as primary time (server) for the cluster and gets it's time from external time server. The remaining cluster nodes use this "primary" node as their time server.

On the "Primary" node (DAFFY), the *.conf looks like

# Your NTP configuration file should always include the following
# driftfile entry. The driftfile is the name of the file that stores
# the clock drift (also known as frequency error) of the system clock.

driftfile SYS$SPECIFIC:[TCPIP$NTP]TCPIP$NTP.DRIFT

#
# Get the time from USNO Washington DC. (Primary)
#
#

server ntp2.usno.navy.mil # US Naval Observatory, Washington, DC.
server time-a.nist.gov # NIST, Gaithersburg, Maryland
server time-b.nist.gov # NIST, Gaithersburg, Maryland

#
# Configure DAFFY as a Backup Time Server.
#

peer 10.xxx.110.xxx # DAFFY

#Server 127.127.1.0
#fudge 127.127.1.0 stratum 6

#

The other nodes' *.conf files just contain

#
# Get the time from DAFFY. (Primary)
#
#

server 10.xxx.110.xxx # DAFFY

#

Dave

Jim_McKinney · ‎03-19-2009

Dave -

Is uncommenting the following two NTP.CONF directives in your example the mechanism that is used with this stack to implement a local_master?

#Server 127.127.1.0
#fudge 127.127.1.0 stratum 6

Karl-Heinz Kwarda · ‎03-19-2009

Dave,

great job! That works!!
Thanks a lot!

Karl-Heinz

Karl-Heinz Kwarda · ‎03-19-2009

closing

The Brit · ‎03-25-2009

Jim,
I believe you are correct, and you can adjust the point at which the Local time server takes over by adjusting the stratum number. (Higher number = lower priority)

Dave.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Time synchronization in a cluster

Time synchronization in a cluster

Re: Time synchronization in a cluster

Re: Time synchronization in a cluster

Re: Time synchronization in a cluster

Re: Time synchronization in a cluster

Re: Time synchronization in a cluster

Re: Time synchronization in a cluster

Re: Time synchronization in a cluster

Re: Time synchronization in a cluster

Re: Time synchronization in a cluster

Re: Time synchronization in a cluster

Re: Time synchronization in a cluster

Re: Time synchronization in a cluster

Re: Time synchronization in a cluster

Re: Time synchronization in a cluster

Re: Time synchronization in a cluster

Re: Time synchronization in a cluster

Re: Time synchronization in a cluster

Re: Time synchronization in a cluster

Re: Time synchronization in a cluster