[unisog] DNS over TCP should we block

Jim Young SYSJHY at langate.gsu.edu
Fri Jan 7 19:09:10 GMT 2005


This has been a most useful and informative thread.  I 
want to apologize in advance for my somewhat delayed 
(and what some will perceive as a MUCH too lengthy) 
reply.   But I do hope that some of your will find this
information very useful.

>>> cgaylord at vt.edu 2005-01-05 15:59:17 >>>
> Leigh Heyman wrote:
> [snip]
> Bottom line: if you accept UDP53 from a host, you need to accept
> Golden Rule of Firewall Configuration: Try not to be stupid.

I couldn't agree more Leigh's statements!  Here's one link 
to the relevant section of rfc1123 that talks about TCP 
and UDP support...


Although the RFC doesn't make support for TCP53
queries a MUST it does make it a *SHOULD*!

For those who continue to believe support for TCP53
is NOT necessary read on...

Here's a link to a message thread that ultimately 
put us on track to determing the real cause of a 
long running intermittent problem with one of our 
DNS servers...


Note: In the message cited above, the author stated 
UDP53 replies of 'greater than 1024'  when I believe he 
meant to say 'greater than 512' but it really doesn't 
change the general problem.

With our specific name server problem, one of our caching 
servers (a v8 bind based named variant) occasionally
reported that it had run out of file handles.   The particular 
log messages have the text "Too many open files":

> Nov 29 14:55:21 myhost named[24345]: [ID 295310 daemon.warning]
db_load could not open: db.cache: Too many open files

We determined that in MOST of these instances named 
silently recovered from this "Too many open files" problem.  
(What damage was really being done is not known).   

But occasionally the "Too many open files"  problem 
happened when named attempted to access a particularly 
critical file.  In these rare cases, named was rendered 
useless for any further inbound (end user) queries.  

In hindsight it appears that we have experienced this 
critical problem perhaps two or maybe three times 
over the last year or so.  Unfortunatly we probably
blamed the named problem on other causes.  

But on 2004-12-29 at 14:55 we had another critical named 
incident.  After a few days of chasing some false leads 
we FINALLY felt that we place blame directly on the 
exhaustion of named's file handles.

After this most recent incident we set up a simply script 
process to monitor named's file handle count e.g:

    ls /proc/$NAMED_PID/fd | wc -w

This script normally reports that the named process has 
15 open files but it will periodically report on 17 open 
files when our expected zone transfers occur.  The 
ulimit on our named process allows for maximum of 
1024 file handles.   We have noticed that several 
a day (between 6 to 12) this count sometimes increase 
by units of 20 to 30 handles at a tim, for example 
from 15 to 35, 75, or 90).  After 3 minutes or so the 
count generally settles back down to 15.  Once or 
twice a day the count jumps much higher to perhaps 
250 or more.  We sometimes go for several days 
without any high count events (i.e. greater than 250).

By running "lsof" against the named process during one of 
these high file handle count events we determined that our 
named had a bunch of file handles stuck in the "TCP_SENT" 
state (remember on un*x systems tcp sessions are 
counted as file handles).

It turns out that our name server was repeatedly 
attempting to setup TCP53 session(s) to some authoritative
third party name servers.  Our first question was why 
would we need to open a TCP session at all?   Answer: An 
earlier UPD53 MX request had resulted in a truncated 
UDP53 MX response.  Consequently our named would 
initiate a TCP53 request to get the desired (long) MX record.  
It appears that our named and/or the underlying tcp/ip 
stack attempts to retry establishing the TCP53 session 
multiple times.

Here is where the story gets interesting (well to me at least)...

If the remote site has a firewall or acl that REJECTS our
TCP53 SYN requests, then we'll (usually) see ICMP error 
messages in response to the initial TCP53 SYNs.  Each 
of these ICMP error replies appears to cause one of the 
orphaned SYN_SENT file handles to close.  

But if the remote site simply "black-holes" our TCP53 request, 
then the SYN_SENT state persists for a much longer period 
of time.  We'll see the list of orphaned file handles quickly 
grow when a site silently discards our TCP53 requests.

There is a multiplier effect here that could be as high as 
1 to 20.  In most cases it appears that these events are
usually triggered when a mail-server attempts to resolve 
some host.   (Adding insult to injury this lookup is generally 
for creating a "return-to-sender" message).  Our mail-server 
will make the UDP53 MX request to our name server.  If and 
when the desired MX info is NOT cached, our name server
will attempt to fetch the desired information.  So one end
user request for a (long) MX record could result in perhaps 
20 or more unresolved TCP53 sessions (i.e. (SYN_SENT) on 
our name server.

By watching our named servers's DNS and ICMP traffic 
we've determined that it only takes a few dozen inbound 
UDP53 MX requests to put our named in jeopardy.

Because the site below was apparently "black-holing" 
our TCP53 requests, on December 2nd we reluctantly 
put in place acls on our border routers to REJECT 
outbound TCP53 request to the to two name servers 
for the domain outstandinginternet.com:

By putting the acls on our router, we at least got some 
ICMP error responses sent back to our named(s) that 
helped clear out the orphaned SYN_SENT handles.

But this is not the only remote site that has caused 

We now have some processes in place to alert us when 
named's file handle count gets too high.   Although we 
haven't actually run of out file handles since December 2nd, 
we have seen the file handle count get as high as 700.  

Until yesterday (January 6th) most of our problems were 
caused by the domain crazyconsumerdeals.net:

But we have NOT (yet) blocked this particular site because 
the majority of our outbound TCP53 requests are replied 
to with an inbound ICMP error response.  But occasionally 
the site fails to reliably respond to our TCP53 requests 
with ICMP error replies (perhaps when their router gets 
too overloaded).   When this happens our file handle 
counts starts to go up.

Until yesterday virtually ALL of these incidents could be 
traced to truncated "MX" record replies.  But then yesterday 
morning  from between 10:36:46 to 10:39:39 our named's 
file handle count rose from 15 to a high of 332.  By 
10:46.32 it was back to 15.  

Reviewing our logs showed that this event was caused by 
orphaned SYN_SENT requests name servers defined
in the domain cisco.com: (adc-rch1-c2-1-w.cisco.com) (adc-rch1-c2-2-w.cisco.com) (adc-rtp1-c2-1-w.cisco.com) (adc-rtp1-c2-2-w.cisco.com) (adc-sin1-c2-1-w.cisco.com) (adc-sjc1-c2-1-w.cisco.com)

All of the cisco.com server's listed above appeared to  
silently discard (black-hole) our named's TCP53 request.  
This whole event could be traced to one single PC that 
attempted the same DNS query several times.   In this 
case the event was initiated because of the following "SRV" 
record request:


Although I'm no expert on the matter, this looks very much 
like a request for some type of MS Active Directory type 
of resource object.   Unfortunately I now expect things 
will get worse before they get better. :-(

I hopeful that we'll soon have our named(s) running on
a v9 bind variant which will at least then give us the 
option of being able to advertise support for larger 
UDP DNS responses (i.e. ro use the ENDS0 
"OPT udpsize=4096" feature mentioned in an earlier 
reply to this thread).

I truly hope some of you find this information useful.
Perhaps some will even reconsider their decision block
DNS over TCP.

Best regards,

Jim Young
Senior Network Engineer
University Computing & Communications Services
Georgia State University
Atlanta, GA 

More information about the unisog mailing list