Coxe, John B. JOHN.B.COXE at saic.com
Wed Dec 24 16:22:33 GMT 2003

I don't want to carry this thread too far, as Johannes did point out these
tangents are outside the subject area (security) of the list.  But I will
briefly address this.  I was not advocating eliminating all email entering
as HTML as spam.  Yes, an overwhelming fraction of it is HTML-encoded.
However, many folks set their default email format (for G-d knows what
reason) to HTML and there are a lot of notification messages that come in
HTML format.  You might be 90% effective with an HTML filter.  But your
false positives would be expected to be unacceptable.  The point was that
the obfuscation employed within the HTML corpus by spammers itself is a spam

<body bgcolor=x000000 text=xffffff>
<H1>Want a BIGGER P<!394838>&#x0114;<b></b>N<gwb>I</gwb><font color=x000100
size=1>N</font>S<font color=x020000 size=1>ULA</font>?

This can get by a lot of filters, as an example.  But the very signature
that it is trying to hide text and break up a key word can tag it as spam if
the filter is smart enough.

The same goes for HTML in the message body.  It should be rendered.
Spammers are obfuscating the content by adding nonsense tags, comments, tag
pairs, or font mods on every letter or two of commonly filtered words and
expressions to bypass filters.  Also hidden text colors are set to the
background color or one or two bits off from it, it is visually equivalent =
invisible.  A pseudo-rendering needs to be done to effectively cancel out
their content.  However, detecting these techniques present in the mail is a
pretty solid consideration for determining it is spam in the first place.

