Networking
cancel
Showing results for 
Search instead for 
Did you mean: 

Faulty DNS and resolv/nsswitch.conf settings?

SOLVED
Go to solution
Doug O'Leary
Honored Contributor

Faulty DNS and resolv/nsswitch.conf settings?

Hi;

 

I have a client who's DNS is occasionally flakey.  For whatever reason, DNS services on the first nameserver stop which causes all sorts of network slow-downs as the timeouts mount for each dns call before the resolver moves on to the second name server.

 

One of the suggestions that I've heard is to put "options timeout:1" in the /etc/resolv.conf file; however, according to the resolv.conf man page, that isn't a valid option.

 

Another suggestion was to put files first in the nsswitch.conf file; however, that's already specified *and* all the hosts in this particular environment are defined in /etc/hosts on each node.  

 

# gsh prod grep hosts /etc/nsswitch.conf
dzprdap1: hosts: files [NOTFOUND=continue] dns [NOTFOUND=return]
dzprdap2: hosts: files [NOTFOUND=continue] dns [NOTFOUND=return]
dzprdap3: hosts: files [NOTFOUND=continue] dns [NOTFOUND=return]
dzprddb1: hosts: files [NOTFOUND=continue] dns [NOTFOUND=return]
dzprddb2: hosts: files [NOTFOUND=continue] dns [NOTFOUND=return]

 

For whatever reason, though, when the first namesever is down, sudo commands, for instance, take 35 seconds to complete. Once the DNS service is restored, sudo access completes in a 1/4 of a second.

 

An ITRC post suggested modifying the resolv.conf file; however, nothing specific seems to work and, conceptually, shouldn't work.  For instance:

 

hosts: files [NOTFOUND=continue] dns [NOTFOUND=return UNAVAIL=continue TRYAGAIN=continue]

 

The continues in the line above are supposed to continue to the next source in nsswitch.conf - not the next nameserver in resolv.conf, correct?  Since there are no other sources in the list, that shouldn't do anything for me.

 

Is there any supported method of getting DNS to failover to the second nameserver in the list if the first is unavailable and/or does anyone know why name resolution doesn't seem to hit hosts first regardless of what's in nsswitch.conf?

 

Thanks for any hints/tips/suggestions.

 

Doug O'Leary

 

 


------
Senior UNIX Admin
O'Leary Computers Inc
linkedin: http://www.linkedin.com/dkoleary
Resume: http://www.olearycomputers.com/resume.html
4 REPLIES
Matti_Kurkela
Honored Contributor
Solution

Re: Faulty DNS and resolv/nsswitch.conf settings?

>One of the suggestions that I've heard is to put "options timeout:1" in the /etc/resolv.conf file; however,

>according to the resolv.conf man page, that isn't a valid option.

 

It may be valid for some other Unix, but not for HP-UX. However, if you've read the resolv.conf man page, I suppose you noticed the retrans and retry keywords?


The default values are 5000 ms and 4 retries, working out to a total of 20 seconds. So the delay of 35 seconds may imply that the client end is first experiencing a DNS lookup delay before actually initiating the connection, then the server is looking up the client hostname of the incoming connection and experiencing another DNS lookup delay before accepting the connection.

 

When working with a flakey DNS, if you are using tcpwrappers, you should make sure your tcpwrapper configuration uses only IP-based rules if possible. The source IP address of the incoming connection is known by definition; to know the hostname, the tcpwrapper needs to perform a reverse hostname lookup.

 

> does anyone know why name resolution doesn't seem to hit hosts first regardless of what's in nsswitch.conf?

 

You're looking at the hosts line in /etc/nsswitch.conf. In HP-UX 11.23 and newer, many/most basic tools are IPv6 aware and will use the ipnodes line instead, even for IPv4.

 

In modern HP-UX, the hosts line is for the classic IPv4-only API (the gethostent(3N) familiy of functions): the ipnodes line controls the new unified IPv4/IPv6 hostname resolution API (the getaddrinfo(3N) family of functions). Only if the ipnodes lookup fails *AND* the caller has specified it wants only IPv4 addresses (either true IPv4 or IPv4 mapped in IPv6), the new API will fall back to the hosts line.

 

(See man 3N getaddrinfo and read carefully the paragraph titled "Name Service Switch-based operation".)

 

If the ipnodes line does not exist, the system uses a built-in default... which is "first dns, then hosts" for ipnodes. I've been burned by this in the past, and I don't think I'm the only one...

MK
Bill Hassell
Honored Contributor

Re: Faulty DNS and resolv/nsswitch.conf settings?

I try to avoid DNS at all costs for servers. They just don't need a lot of unique addresses so the hosts file won't need hundreds of lines. And most server situations have stable addresses, so the benefit of a central name server simply isn't worth the severe impact of a flakey DNS system.A 24x7 database becomes a mess when DNS goes awry.

 

I put this line on every server, any version from 10.20 to current:

 

hosts:        files [NOTFOUND=continue UNAVAIL=continue] dns
ipnodes:    files [NOTFOUND=continue UNAVAIL=continue] dns

(yeah, I know old systems don't have IPv6, but the ipnodes is just ignored in those cases)

 

Also, a lot of commercial backup programs (Data Protector included) do something incredibly stupid during backups by querying the resolver for *every* file being backed up. 50 million files, 50 million useless queries. This can clobber a nameserver and generate a lot of meaningless traffic. Why the backup programs don't create a cache of the few machines that are clients at the start of a backup is beyond my comprehension.



Bill Hassell, sysadmin
Matti_Kurkela
Honored Contributor

Re: Faulty DNS and resolv/nsswitch.conf settings?

> Also, a lot of commercial backup programs (Data Protector included) do something incredibly stupid during

> backups by querying the resolver for *every* file being backed up. 50 million files, 50 million useless queries.

> This can clobber a nameserver and generate a lot of meaningless traffic. Why the backup programs don't

> create a cache of the few machines that are clients at the start of a backup is beyond my comprehension.

 

This sounds like a good case for implementing a cache-only DNS server locally on the backup server. Set it to respond to queries from 127.0.0.1 only and you can be assured the extra load caused by the DNS server will be pretty negligible with modern hardware. I do this for backup servers, monitoring servers and other things that need to know about a lot of hosts all over a large enterprise network.

 

Also, for extra DNS reliability for backups, you might upgrade the cache-only DNS server to a stealth-slave for your internal zones: ask your DNS admin to enable zone transfer and update notifications from one of the regular DNS servers to the backup server, add slave zone declarations for the internal zones in the BIND configuration file, but *don't* add NS records to the zones as you would with a normal slave DNS server.

 

With this configuration, your backup server will be guaranteed to always have a complete, up-to-date copy of your internal zones available locally, even if all the other DNS servers go dark. Since the backup server is not listed in the zones' NS records, nobody else will send DNS queries to the backup server.

 

Remember that if the connection between the backup server with the stealth-slave DNS server and the regular DNS server fails, your backups will keep "working just fine" for the duration of the DNS data expiration time (configurable in the DNS SOA record for the zone, often a week or two), and then suddenly fail when the slave's zone data expires. So set up monitoring to alert someone if the internal DNS zone transfers start failing.

MK
Doug O'Leary
Honored Contributor

Re: Faulty DNS and resolv/nsswitch.conf settings?

Hey;

 

>>You're looking at the hosts line in /etc/nsswitch.conf. In HP-UX 11.23 and newer, many/most basic tools are IPv6 aware and will use the ipnodes line instead, even for IPv4.

 

That, I was unaware of.  Thank you very much for the info.  Getting that updated will ensure that, the next time this happens, it won't be as much of an impact... I told them that the right answer was to stop messing with their DNS servers; however, that didn't go over very well... :)

 

Thanks again for the info.

 

Doug O'Leary


------
Senior UNIX Admin
O'Leary Computers Inc
linkedin: http://www.linkedin.com/dkoleary
Resume: http://www.olearycomputers.com/resume.html