Re: ntp problem

donna hofmeister · ‎01-16-2009

i've got an 11.11 server that refuses to sync (or perhaps stay sync'd is a better way to say it) with it's time servers.

these time servers are used by many other servers, so i don't think the problem is with them.

if i run ntpdate -d [time server], i can see that there is two-way communication occurring. at the end of ntpdate, i can see a message that the clock is being adjusted.

the kicker is if i run ntpq -p none of the time servers listed show up with anything in column 1 (no *, + or -). also each server's dispersion value (number in the last column) is huge -- the the 15-16k range.

so what gives? i can talk to the time servers (tracert and ping are joyous too, btw) but ntp just wont behave.

Mark McDonald_2 · ‎01-16-2009

Hi

What is in your ntp.conf?
Does the drift file exist?

Jeeshan · ‎01-16-2009

did you start the xntpd daemon?
did you make changes in /etc/rc.config.d/netdaemons file?
check that the ntp port accessible.

a warrior never quits

Matti_Kurkela · ‎01-17-2009

NTP tries to get the local clock set to to within +/- 128 milliseconds of the remote clock(s). If the difference goes beyond that, xntpd logs a "synchronization lost" message in the syslog.

To get within that +/- 128 ms target, it needs to measure the travel time of its query packets and correct for it. The average round-trip delay is indicated in the "delay" column of the ntpq -p output.

However, if the round-trip delay varies widely between one query and the next, the average dealy estimate won't be of much help. The "disp" column indicates how much the round-trip time seems to vary. If it's anything over 100 (milliseconds) or so, staying in sync with a remote NTP server is mostly the result of some good luck. With a dispersion of 15-16k, it's not going to happen.

Your pings may get through 100%, but how are the round-trip times? Let the ping command run for a while (e.g. 100 pings or so), and then examine the round-trip times.

I think there are at least two possible causes:

1.) Your network may be congested.
The solution: find the congested switch/segment/router and get more bandwidth where it's needed.

2.) There might be a hardware/software problem that is messing up your local real-time clock. What's your server model? Are you running vPars? How up to date are you with Quality Packs?

(Back when we set up our first vPars on rp8400s, we had some clock problems. I seem to recall the root cause was hardware-related, but there may have been some kernel/vPar patches related to that too - maybe interim fixes or work-arounds?)

MK

MK

BUPA IS · ‎01-18-2009

Hello,
as other have suggested in may be a local hardwre clock issue or it may be network congestion /routing .

Please post your ntp.conf file and the ouptut of ntpq -p
also
xntpdc -c loopinfo
xntpdc -c sysinfo
what is the offset when you run ntpdate ?

Mike

Help is out there always!!!!!

donna hofmeister · ‎01-19-2009

This system is an npar in a SD64B complex.

afaik, it's up-to-date on patches. of most importance, the latest auto_parms and ntp patches are installed.

here's an ntpq -p:

remote refid st t when poll reach delay offset disp
===================================================
===========================
ns1.ffdc.xxx.co frfdca60ntp01.p 2 u 22 64 7 0.38 310.772 3953.75
ns2.ffdc.xxx.co frfdca60ntp01.p 2 u 13 64 17 0.27 322.417 1983.26
ns3.ffdc.xxx.co sndgca64ntp01.p 2 u 60 64 7 0.27 264.304 3953.77
/root> date
Sat Jan 17 12:24:54 PST 2009

a couple of comments about the above
. xxx in the node name is a cover-up
. this snapshot came from having left xntpd run for about 20 hours. the disp number are clearly down from the 15-16k range but still no joy with achieving a sync
. just for giggles output from 'date' is also shown. the date and time are accurate (according to when this was grabbed)

here's an example of ntpdate output. (this was grabbed at a different day/time than the above!):

ntpdate -d ns1.ffdc.xxx.com
transmit(150.234.210.5)
receive(150.234.210.5)
transmit(150.234.210.5)
receive(150.234.210.5)
transmit(150.234.210.5)
receive(150.234.210.5)
transmit(150.234.210.5)
receive(150.234.210.5)
transmit(150.234.210.5)
server 150.234.210.5, port 123
stratum 2, precision -20, leap 00, trust 000
refid [150.234.134.30], delay 0.02623, dispersion 0.00053
transmitted 4, in filter 4
reference time: cd0f8e48.588e368f Wed, Jan 7 2009 12:25:44.345
originate timestamp: cd0f9133.f41e364b Wed, Jan 7 2009 12:38:11.953
transmit timestamp: cd0f9133.d182b000 Wed, Jan 7 2009 12:38:11.818
filter delay: 0.02623 0.02779 0.02721 0.02863
0.00000 0.00000 0.00000 0.00000
filter offset: 0.132466 0.133242 0.132907 0.133669
0.000000 0.000000 0.000000 0.000000
delay 0.02623, dispersion 0.00053
offset 0.132466

7 Jan 12:38:11 ntpdate[14093]: adjust time server 150.234.210.5 offset 0.132466 sec

we're trying to get the network people involved to look into any network congestion issues. but fwiw, network statistics are showing zero errors and tiny (<10) retransmit values.

i get the xntpdc output to you as soon as i can.

thanks!

donna hofmeister · ‎01-19-2009

ok...

*every* npar on this 'dome is exhibiting the same ntp problem (oh my! i just found this out!)

here's some more output:

/root> xntpdc -c loopinfo
offset: 0.439578 s
frequency: -0.688 ppm
poll adjust: 6
watchdog timer: 278 s
/root> xntpdc -c sysinfo
system peer: 0.0.0.0
system peer mode: unspec
leap indicator: 11
stratum: 16
precision: -17
root distance: 0.00475 s
root dispersion: 1.00179 s
reference ID: [150.234.210.5]
reference time: cd1f317f.5e12b000 Mon, Jan 19 2009 9:06:07.367
system flags: monitor pll stats
frequency: 64.000 ppm
stability: 63.430 ppm
broadcastdelay: 0.003906 s
authdelay: 0.000122 s

BUPA IS · ‎01-19-2009

Donna,
The reach output of 7 or 17 indicates that only the last 3 or 4 polls were sucessful or the clock was recently reset. This can be caused by an argument between ntp and the local clock .

Your ntpq -p does not list it, Can you confirm that there is no local entry in your ntp.conf file or remove these two lines if you have them .

server 127.127.1.0 # local clock
fudge 127.127.1.0 stratum 10

Can you grep syslog.log for ntp to see if there are any reset or other messages relating to time being altered .

Help is out there always!!!!!

donna hofmeister · ‎01-19-2009

here's the (slight obfuscated) ntp.conf file:

server ntp1.ffdc.xxx.com
server ntp2.ffdc.xxx.com
server ntp3.ffdc.xxx.com

restrict default notrust nomodify noserve
restrict 150.234.210.5 mask 255.255.255.255 nomodify
restrict 150.234.210.205 mask 255.255.255.255 nomodify
restrict 150.234.210.18 mask 255.255.255.255 nomodify
restrict 127.0.0.1 mask 255.255.255.255
enable monitor

driftfile /etc/ntp.drift

statsdir /var/tmp/ntp

(this file is used by many many systems)

is there a known issue with npars on 64B 'domes? does a particular cpu production run cause this kind of problem?

donna hofmeister · ‎01-19-2009

here's a recent ntpq.

:/root> ntpq -p
remote refid st t when poll reach delay offset disp
==============================================================================
ns1.ffdc.xxx.co frfdca60ntp01.p 2 u 64 64 17 0.34 314.056 1983.26
ns2.ffdc.xxx.co frfdca60ntp01.p 2 u 55 64 37 0.35 325.288 1002.99
ns3.ffdc.xxx.co sndgca64ntp01.p 2 u 38 64 37 0.41 346.199 1002.91

the disp number is s-l-o-w-l-y going down (compare this to the early one shown above)

rick jones · ‎01-19-2009

those are some surprisingly large offsets if indeed ntpdate is run to set the clock off one of the servers at boot time before xntpd is started. is NTPDATE_SERVER set in /etc/rc.config.d/netdaemons? (I may have the filename wrong there)

there is no rest for the wicked yet the virtuous have no pillows

donna hofmeister · ‎02-23-2009

sorry for not getting back on this until now...

updating the story a bit...so i've found out that there is a mix of 11.11 and 11.23 vpars in one sd64b. i've attached a netstat -s from on the the vpars. while i know it's just a snapshot in time, it still has a story to tell, i think.

the current values in nddconf are:

TRANSPORT_NAME[0]=tcp
NDD_NAME[0]=tcp_conn_request_max
NDD_VALUE[0]=15000

TRANSPORT_NAME[1]=tcp
NDD_NAME[1]=tcp_fin_wait_2_timeout
NDD_VALUE[1]=120000

tcp_conn_request_max was just added at my request (i don't know why [0] was selected). tcp_fin_wait_2_timeout is apparently a value that is "always set" and no one seems to know why.

given the little bit of feedback that i have, adding max connection requests hasn't helped.

characteristics that continue to be seen are:
. ntpq won't sync (still)
. "slow" network response time
. nettl traces are showing retransmits, dup acks, tcp out-of-order (truly unpleasant stuff)

rick jones · ‎02-23-2009

tcp_conn_request_max merely sets the overall limit on the maximum size of a TCP listen queue - the actual minimum being the lesser of tcp_conn_request_max and what the application passes-in on its listen() calls.

The other one is a kludge timeout to deal with FIN_WAIT_2 connections.

The 0 and 1 are the array indicies, which in this context start at zero. The nddconf file will later be "sourced" to create shell arrays of ndd settings to be applied...

As NTP makes no use of TCP, neither tunable should have any bearing on NTP problems :)

As for the netstat stats, while the TCP stats do indeed look a little troubling, and imply there may be almost a 1% packet loss rate, which can be doubleplusungood for TCP performance (depending on the nature of the traffic), a 1% packet loss rate for NTP shouldn't be all that bad.

That the UDP stats for bad header and socket overflow are the same suggests there is a bug in the UDP stats.

For future reference, running pairs of netstat snapshots through beforeafter:

ftp://ftp.cup.hp.com/dist/networking/tools/

can be helpful - particularly if the snapshots are over an "interesting" interval.

*Is* there indeed something configured for NTPDATE_SERVER in the /etc/rc.config.d/netdaemons file? (might have some syntax errors there) My understanding is that an ntpdate -d command will not _actually_ adjust system time:

-d Enable the debugging mode, in which ntpdate will go through all the steps, but not adjust the local clock. information useful for general debugging will also be printed.

Picking your most reliable NTP server to put in the NTPDATE_SERVER entry is goodness - it will make sure that the system time is "close" to that of an ntp server at startup.

there is no rest for the wicked yet the virtuous have no pillows

donna hofmeister · ‎02-23-2009

regarding nddconf and [0] and [1] -- thanks. i was thinking they referred to lan cards (in a netconf-kinda way).

is "doubleplusungood" trademarked? that can even be used in polite society :-)

i'll dig into the state of the network patches. in my old mpe days, i could ask what the current set of network patches were. is there a simple answer to the same question in hp-ux-speak?

yes, i know about before-n-after :-) and if i had more than one netstat...i would have run it.

regarding ntp -- i don't think there is an ntp problem. rather it's a symptom.

rick jones · ‎02-23-2009

MPE days eh - well, at least I'm not suggesting "netcontrol start" :) (I used to do CPE and development on MPE/XL from about 1.2 to 2.2/3.0ish)

Yes, doubleplusungood can be used in polite society - one plusbeneficial side-effect of Newspeak is making it impossible to form an impolite thought in addition to a seditious one :)

1% packet loss rates (guesstimated by the ratio of datasegments retransmitted to data segments sent) shouldn't be all that bad for NTP. It will make reachability scores a little worse, but NTP really aught to be able to handle that. Did you have some other root cause in mind when looking at the netstat statistics?

there is no rest for the wicked yet the virtuous have no pillows

BUPA IS · ‎02-24-2009

Donna,
Between your first post and this one we have fixed a similar problem with another maufacturers system by applying a firmware fix. Basically the hardware clock was not being maintained correcly and was arguing with NTP .
You mentioned that all the partitions on the system suffered from the same problem .

I found this patch for the superdome firmware
Patch Name: PF_CSFW0006
In addition,
- Real time clock showing inaccurate time.

Patch Description: HP Superdome Utility Firmware 7.34 and PDC 36.8
release notes here

http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=id&prodTypeId=15351&prodSeriesId=322811&swItem=PF-CSFW0009&prodNameId=322813&swEnvOID=7&swLang=13&taskId=135&mode=4&idx=0

I hope this might be of some use.

Mike

Help is out there always!!!!!

donna hofmeister · ‎02-24-2009

Was the link correct? It seems to be pointing to a rather old release:
Version: 36.8 (8 May 2006)

BUPA IS · ‎02-24-2009

Donna,
That was the first link I came across where I found a reference to a real time clock error.

I suppose I should have asked what firmware level your are on and which model of superdome you have installed before rushing into print.

Mean while I have had another look and found this one from Nov 2007

PDC_FW 043.006.000 contains the following fixes:
...
Time drift for HP-UX operating systems up to 48 minutes per month. This has been fixed.

http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=15351&prodSeriesId=322811&swItem=ux-55398-1&mode=4&idx=2

Mind you I may still be completely wrong

Mike

Help is out there always!!!!!

donna hofmeister · ‎02-24-2009

mike -- thanks for both links. we're looking to see if the 2nd/newer one is applicable.

rick -- ntp apparently is just the tip of the iceberg. I've asked them to apply the latest transport and dependent patches since the 1st rule in network problem solving is *patch* :-)
(ok...so i'm late in applying the rule)

stay tuned...

rick jones · ‎02-24-2009

being up on latest patches is generally goodness - however, 1% TCP retransmission rate is not, in and of itself, cause to patch.

there is no rest for the wicked yet the virtuous have no pillows

donna hofmeister · ‎04-20-2009

at long last, one more update...

bupa-mike hit the nail on the head! it was indeed firmware. with the latest 'recipe' applied all the impacted partitions are joyfully in-sync.

thank you much! - d

ps, rick don't worry, you're still da man. while most things are network problem occasionally one or two problems aren't :-)

donna hofmeister · ‎04-20-2009

.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: ntp problem

ntp problem

Re: ntp problem

Re: ntp problem

Re: ntp problem

Re: ntp problem

Re: ntp problem

Re: ntp problem

Re: ntp problem

Re: ntp problem

Re: ntp problem

Re: ntp problem

Re: ntp problem

Re: ntp problem

Re: ntp problem

Re: ntp problem

Re: ntp problem

Re: ntp problem

Re: ntp problem

Re: ntp problem

Re: ntp problem

Re: ntp problem

Re: ntp problem