Operating System - OpenVMS
1751843 Members
5249 Online
108782 Solutions
New Discussion юеВ

TCPIP name resolver not failing over to second dns

 
SOLVED
Go to solution
Clark Powell
Frequent Advisor

TCPIP name resolver not failing over to second dns

On tcpip services V5.6 - ECO 3 I have two name server defined:
ALPHAD> tcpip sho name
BIND Resolver Parameters
Local domain: INTERNAL.VMMC.ORG
System
State: Started, Enabled
Transport: UDP
Domain: INTERNAL.VMMC.ORG
Retry: 4
Timeout: 4
Servers: NS4.VMMC.ORG, NS3.VMMC.ORG
Path: INTERNAL.VMMC.ORG, VMMC.ORG, VMAD.VMMC.ORG, RESTRICTED.VMMC.ORG

But when NS4 was being rebooted the name resolver took 20 seconds to fail over to NS3. For example, starting nslookup took this much time to get a prompt:
ALPHAD> nslookup
*** Can't find server name for address 10.25.0.50: No response from server
Default Server: ns3.vmmc.org
Address: 10.57.0.50

>

IF I have two name servers defined I should have no problem if one of them is down. Does anyone have a suggestion as to why this is not happening?

thanks
Clark Powell
PS NS1 and NS2 are defined in the local host table.
6 REPLIES 6
Clark Powell
Frequent Advisor

Re: TCPIP name resolver not failing over to second dns

I should also point out that we are using two new name servers since midnight Wednesday. We did not have this problem with the old pair of name servers even though one of the two would be shutdown for maintenance from time to time.
Graham Burley
Frequent Advisor

Re: TCPIP name resolver not failing over to second dns

Did normal name resolution take a similar time to return results? e.g. telnet name, ping name, etc.

I don't think it's correct to assume that nslookup behaviour is the same as the name resolver. As I understand it nslookup is a dns query tool that talks directly to name servers, rather than calling on the name resolver to return results (which might be cached).
labadie_1
Honored Contributor

Re: TCPIP name resolver not failing over to second dns

May be seeing the dialog could help

$ define tcpip$bind_res_options "debug"
and then, a command, such as
$ ucx sh host xxx
or
$ telnet xxx
The Brit
Honored Contributor

Re: TCPIP name resolver not failing over to second dns

Hi Clark,
I suspect that the problem is that the name server requests name resolution from the listed DNS machines using a "Round Robin" approach.

i.e. the resolution request is sent to the first server on the list. It will then wait for that request to timeout before sending the request to the second server on the list.

I dont know if that timeout interval is adjustable, however I know that it can seriously slow down the name resolution process (within the Name Service).

I agree with the previous respondent that I would not consider a request via nslookup as being equivalent to using the Name Service.

If this is likely to be a (semi-) permanent situation, I would suggest rearranging the order of the DNS servers in the name service server list.

Dave.
Hoff
Honored Contributor
Solution

Re: TCPIP name resolver not failing over to second dns

For controlling the timeouts:

TCPIP> SET CONFIG NAME_SERVICE /RETRY=n /TIMEOUT=n

If you have hosts you hit often or that are critical hosts and you have flaky DNS, you can add entries in the local hosts file (TCPIP> SET HOST) to skip the need for the DNS resolver lookup for critical hosts.

The DNS Server uses round robin (and variously with a random starting point), but I don't know off-hand what the BIND resolver uses. (And caching would distort that, regardless.)

I'm running small and dedicated server boxes for (among other things) local DNS resolution. (And yes, in preference to using OpenVMS for DNS.) This as a result of occasionally flaky upstream DNS.
Clark Powell
Frequent Advisor

Re: TCPIP name resolver not failing over to second dns

Thanks, all.
I will lower the timeout to 3 from 4 since we have local name servers and the retry to 2. The name service was still working while we rebooted one of our two name servers but our applications apparently found the 16 second delay, (ie the default: retry=4,timeout=4) intollerable. I think the default numbers were probably designed for machines on the internet while machines on a private network are not needing all that extra time and tries.