[Dshield] New email spam

Kenneth Coney superc at visuallink.com
Wed Dec 24 12:06:12 GMT 2003


So what you are saying is email containing HTML or other coding should 
simply be refused.  That would end 90% of the Spam.  Then a simple 
dictionary filter on the Subject line would eliminate any messages with 
padding or coding.  I like it.  Back to .txt only we go.



Subject: RE: [Dshield] New email spam
From: "Coxe, John B." <JOHN.B.COXE at saic.com>
Date: Mon, 22 Dec 2003 08:40:30 -0800
To: "'General DShield Discussion List'" <list at dshield.org>

This serves various purposes.  The most important one to them is that each
message has a unique subject.  So those writing filters for the most
prevalent, by count, subjects entering their MTAs will miss them as their
entire campaign consists of lots of messages, each with a subject count of
one.

If you want to see broken spam programs, note the subjects that come in
literally with "%RND_UC_CHAR[2-8]" or "%RANDOM_WORD".  Pretty easy to filter
those.

The hardest subjects to filter are those utilizing character encoding in the
subject line.  The quoted-printable is easy enough.  However, base64
encoding requires a decoder as part of the filter.  See RFC 1345 for
encodings.  The most prevalent one used is ISO-8859-1.  In fact, it accounts
for practically all of the encoded spam.  Funny that the spammers haven't
jumped over to use "latin1", which is exactly the same (just an alias for
iso-8859-1), to bypass folks who put in a general iso-8859-1 filter.  There
is sure to be a lot of growth in this area.  It takes every three characters
and transforms them to four other characters.  The entire encoding is
completely unreadable.  But it displays in MS Outlook and will render as the
decoded form when forwarded.

An example might be a Subject like "V2FudCBhIEJJR0dFUiBQRU5JUz8=", which
decoded has the "P" word in it.  (See, for example,
http://makcoder.sourceforge.net/demo/base64.php to decode this or your own
subjects.)  One can defend against this without an inline decoder to some
extent, by filtering on the encoding for " PE" followed by "NIS" (IFBFTklT)
and "PEN" followed by "IS?" (UEVOSVM/), to take advantage of two of three
offsets.  Even then, it only gets this particular uppercase case and with a
bang after it or a space before it.  The third offset would be "ENI"
followed by "S?" (RU5JUz8=), which does get this one.  As long as you are
comfortable with taking the chance that no real mail will come with a
subject ending with uppercase "ENIS?", it all is fine.  (I cannot think if
any such words.  However, "PENI" will also match "PENINSULA" -- something to
be careful about.)  Anyway, spammers will mix case, change punctuation, push
a star pr dash or space between each pair of letters, like "P*E*N*I*S", use
grave, accent, or umlaut over the "E", etc ... all of the tricks they use in
unencoded subjects.  The bottom line is that the only defense is to run
inline decoding prior to any filtering to be effective against this.

The same goes for HTML in the message body.  It should be rendered.
Spammers are obfuscating the content by adding nonsense tags, comments, tag
pairs, or font mods on every letter or two of commonly filtered words and
expressions to bypass filters.  Also hidden text colors are set to the
background color or one or two bits off from it, it is visually equivalent =
invisible.  A pseudo-rendering needs to be done to effectively cancel out
their content.  However, detecting these techniques present in the mail is a
pretty solid consideration for determining it is spam in the first place.

It is ugly out there and the spammers are doing anything they can come up
with to ram their spam past defenses.  One thing I am surprised they have
not done (apparently) yet is custom exploit whitelists.  Suppose they simply
autocrawled the target domain's public website, parsed out all of the words
from all of the pages, discarded dictionary words and then used words
appearing at least a few times as an effective corporate lexicon for the
target domain.  Then they simply insert these randomly in the spam subjects
in hopes that they would escape filtering through gateway whitelists used to
mitigate false positive impacts.  That could easily become a feature for
inclusion in top spam list databases.  For the lists compiled from trawling
through usenet (etc), such a lexicon could be created using the content of
the post, etc.





More information about the list mailing list