[unisog] DNS over TCP should we block
SYSJHY at langate.gsu.edu
Fri Jan 7 19:09:10 GMT 2005
This has been a most useful and informative thread. I
want to apologize in advance for my somewhat delayed
(and what some will perceive as a MUCH too lengthy)
reply. But I do hope that some of your will find this
information very useful.
>>> cgaylord at vt.edu 2005-01-05 15:59:17 >>>
> Leigh Heyman wrote:
> Bottom line: if you accept UDP53 from a host, you need to accept
> Golden Rule of Firewall Configuration: Try not to be stupid.
I couldn't agree more Leigh's statements! Here's one link
to the relevant section of rfc1123 that talks about TCP
and UDP support...
Although the RFC doesn't make support for TCP53
queries a MUST it does make it a *SHOULD*!
For those who continue to believe support for TCP53
is NOT necessary read on...
Here's a link to a message thread that ultimately
put us on track to determing the real cause of a
long running intermittent problem with one of our
Note: In the message cited above, the author stated
UDP53 replies of 'greater than 1024' when I believe he
meant to say 'greater than 512' but it really doesn't
change the general problem.
With our specific name server problem, one of our caching
servers (a v8 bind based named variant) occasionally
reported that it had run out of file handles. The particular
log messages have the text "Too many open files":
> Nov 29 14:55:21 myhost named: [ID 295310 daemon.warning]
db_load could not open: db.cache: Too many open files
We determined that in MOST of these instances named
silently recovered from this "Too many open files" problem.
(What damage was really being done is not known).
But occasionally the "Too many open files" problem
happened when named attempted to access a particularly
critical file. In these rare cases, named was rendered
useless for any further inbound (end user) queries.
In hindsight it appears that we have experienced this
critical problem perhaps two or maybe three times
over the last year or so. Unfortunatly we probably
blamed the named problem on other causes.
But on 2004-12-29 at 14:55 we had another critical named
incident. After a few days of chasing some false leads
we FINALLY felt that we place blame directly on the
exhaustion of named's file handles.
After this most recent incident we set up a simply script
process to monitor named's file handle count e.g:
ls /proc/$NAMED_PID/fd | wc -w
This script normally reports that the named process has
15 open files but it will periodically report on 17 open
files when our expected zone transfers occur. The
ulimit on our named process allows for maximum of
1024 file handles. We have noticed that several
a day (between 6 to 12) this count sometimes increase
by units of 20 to 30 handles at a tim, for example
from 15 to 35, 75, or 90). After 3 minutes or so the
count generally settles back down to 15. Once or
twice a day the count jumps much higher to perhaps
250 or more. We sometimes go for several days
without any high count events (i.e. greater than 250).
By running "lsof" against the named process during one of
these high file handle count events we determined that our
named had a bunch of file handles stuck in the "TCP_SENT"
state (remember on un*x systems tcp sessions are
counted as file handles).
It turns out that our name server was repeatedly
attempting to setup TCP53 session(s) to some authoritative
third party name servers. Our first question was why
would we need to open a TCP session at all? Answer: An
earlier UPD53 MX request had resulted in a truncated
UDP53 MX response. Consequently our named would
initiate a TCP53 request to get the desired (long) MX record.
It appears that our named and/or the underlying tcp/ip
stack attempts to retry establishing the TCP53 session
Here is where the story gets interesting (well to me at least)...
If the remote site has a firewall or acl that REJECTS our
TCP53 SYN requests, then we'll (usually) see ICMP error
messages in response to the initial TCP53 SYNs. Each
of these ICMP error replies appears to cause one of the
orphaned SYN_SENT file handles to close.
But if the remote site simply "black-holes" our TCP53 request,
then the SYN_SENT state persists for a much longer period
of time. We'll see the list of orphaned file handles quickly
grow when a site silently discards our TCP53 requests.
There is a multiplier effect here that could be as high as
1 to 20. In most cases it appears that these events are
usually triggered when a mail-server attempts to resolve
some host. (Adding insult to injury this lookup is generally
for creating a "return-to-sender" message). Our mail-server
will make the UDP53 MX request to our name server. If and
when the desired MX info is NOT cached, our name server
will attempt to fetch the desired information. So one end
user request for a (long) MX record could result in perhaps
20 or more unresolved TCP53 sessions (i.e. (SYN_SENT) on
our name server.
By watching our named servers's DNS and ICMP traffic
we've determined that it only takes a few dozen inbound
UDP53 MX requests to put our named in jeopardy.
Because the site below was apparently "black-holing"
our TCP53 requests, on December 2nd we reluctantly
put in place acls on our border routers to REJECT
outbound TCP53 request to the to two name servers
for the domain outstandinginternet.com:
By putting the acls on our router, we at least got some
ICMP error responses sent back to our named(s) that
helped clear out the orphaned SYN_SENT handles.
But this is not the only remote site that has caused
We now have some processes in place to alert us when
named's file handle count gets too high. Although we
haven't actually run of out file handles since December 2nd,
we have seen the file handle count get as high as 700.
Until yesterday (January 6th) most of our problems were
caused by the domain crazyconsumerdeals.net:
But we have NOT (yet) blocked this particular site because
the majority of our outbound TCP53 requests are replied
to with an inbound ICMP error response. But occasionally
the site fails to reliably respond to our TCP53 requests
with ICMP error replies (perhaps when their router gets
too overloaded). When this happens our file handle
counts starts to go up.
Until yesterday virtually ALL of these incidents could be
traced to truncated "MX" record replies. But then yesterday
morning from between 10:36:46 to 10:39:39 our named's
file handle count rose from 15 to a high of 332. By
10:46.32 it was back to 15.
Reviewing our logs showed that this event was caused by
orphaned SYN_SENT requests name servers defined
in the domain cisco.com:
All of the cisco.com server's listed above appeared to
silently discard (black-hole) our named's TCP53 request.
This whole event could be traced to one single PC that
attempted the same DNS query several times. In this
case the event was initiated because of the following "SRV"
Although I'm no expert on the matter, this looks very much
like a request for some type of MS Active Directory type
of resource object. Unfortunately I now expect things
will get worse before they get better. :-(
I hopeful that we'll soon have our named(s) running on
a v9 bind variant which will at least then give us the
option of being able to advertise support for larger
UDP DNS responses (i.e. ro use the ENDS0
"OPT udpsize=4096" feature mentioned in an earlier
reply to this thread).
I truly hope some of you find this information useful.
Perhaps some will even reconsider their decision block
DNS over TCP.
Senior Network Engineer
University Computing & Communications Services
Georgia State University
More information about the unisog