[unisog] SP*M Detection Methods & Processes
syl at alcor.concordia.ca
Fri Sep 22 17:44:43 GMT 2006
On Fri, 22 Sep 2006, Bill Martin wrote:
> Well, for us it is that time of year again for to experience the
> "Goldie Locks" syndrome with our spam detection process again.
I'm noticing an increase in mail sent towards our mail servers
(primarily an increase in spam) every fall, over the past few years.
It's quite interesting, actually, though frustrating.
> I'm sure you are all familiar with it.. you know, the "it's to
> much", "it's not enough", and still others saying "it's just
Mostly "it's not enough", until they start to see statistics of how
much is indeed being rejected, then it becomes "it's good, but still
> Our current architecture consists of multiple gateways, running Amavis,
> handing off to SpamAssassin, an anti-virus package and of coarse our
Here, except for our Faculty of Engineering and Computer Science, who
handle their own mail (in very similar though not identical manner to
what I'm about to describe), all mail into the university (and all
outbound mail as well) passes through one of our main mail servers,
where spam detection is done as follows:
- rbl-plus.mail-abuse.org is checked from within Sendmail. matching
messages are rejected outright.
- opm.blitzed.org is checked from within Sendmail. matching
messages are rejected outright.
- a small amount of blocking is done via TCP_Wrappers or Sendmail's
own access database.
- Sendmail also performs basic sanity checks on envelope sender.
(is the sender qualified with a domain name? does the domain
- Mimedefang is used to provide additional checks, many of which were
written in-house: HELO/EHLO argument from system outside our network
doesn't pretend to be one of our own systems (I'm also contemplating
checking whether any systems on our network are claiming to be
outside of our namespace, as that would likely help prevent outbound
spam, but it may be more prone to false-positives), HELO/EHLO
argument is properly formatted (FQDN or bracketed dotted quad), ...
- Mimedefang also checks the message against SpamAssassin, again with
some additional rules that were created in-house, and others that
were added from contributions (via the spamassassin-users mailing
list, for example). We currently reject at 11.4, and identify as
"will soon reject" at 11.2 (yes we've narrowed it down that far
that 0.2 in the SpamAssassin score can make the difference in a
false-positive), and add a "score" header (for users to use at
their own discretion) at 5.0.
- SpamAssassin checks various additional RBL's, most of which we
found produced at least some false-positives but were still good
enough to add to the probability that a message might be spam.
- The bulk of human effort has been in creating and maintaining a
list of regular expressions used by MimeDefang to try and identify
(by PTR record, so obviously not fool-proof) what are probably
client systems connecting directly to our mail servers rather than
to their own ISPs' mail servers. This is effective, but a _lot_ of
- MimeDefang checks with the mail-delivery hosts prior to accepting
messages, to avoid accepting for non-deliverable recipients then
being stuck trying to send the bounces.
- MimeDefang checks any messages that are accepted beyond the above,
against Sophos (via Sophie) for virus-detection. Matching messages
are silently dropped (but logged).
> We do have Bayes enabled as well as a number of other lists.
I'm interested to know how you perform training of a Bayesian filter
acting globally. That's a problem we would love to be able to solve
> So, given the architecture, (which from what I see is very close to
> what some companies are doing with their appliances) how does this
> compare to what others are doing?
There are two things that we would like to be able to implement here,
but both present certain technical challenges (some less difficult to
overcome, at least in theory, than others):
Bayesian classification: it has occured to us that we very likely
would need to have per-recipient data for this to be truly effective.
That leads us to a user-training problem which is likely the most
difficult challenge to overcome.
Selective greylisting: note "selective". There are some sources of
mail we have that we (our user community) would prefer to not have
delayed, even if we're talking about a single queue-delay per
sender/recipient pair in a given period of time. This is likely not
that difficult for us to implement, but it's of course trivial for
spammers to circumvent. Still we believe that at this time, it would
prove to be very effective, and this will likely happen in the
Sylvain Robitaille syl at alcor.concordia.ca
Systems and Network analyst / (ex)Postmaster Concordia University
Instructional & Information Technology Montreal, Quebec, Canada
More information about the unisog