[unisog] SP*M Detection Methods & Processes

Sylvain Robitaille syl at alcor.concordia.ca
Fri Sep 22 17:44:43 GMT 2006

On Fri, 22 Sep 2006, Bill Martin wrote:

> Well, for us it is that time of year again for to experience the
> "Goldie Locks" syndrome with our spam detection process again.

I'm noticing an increase in mail sent towards our mail servers
(primarily an increase in spam) every fall, over the past few years.

It's quite interesting, actually, though frustrating.

> I'm sure you are all familiar with it..   you know, the "it's to
> much", "it's not enough", and still others saying "it's just
> right"....

Mostly "it's not enough", until they start to see statistics of how
much is indeed being rejected, then it becomes "it's good, but still
not enough".

> Our current architecture consists of multiple gateways, running Amavis,
> handing off to SpamAssassin, an anti-virus package and of coarse our
> MTA.

Here, except for our Faculty of Engineering and Computer Science, who
handle their own mail (in very similar though not identical manner to
what I'm about to describe), all mail into the university (and all
outbound mail as well) passes through one of our main mail servers,
where spam detection is done as follows:

   - rbl-plus.mail-abuse.org is checked from within Sendmail.  matching
     messages are rejected outright.

   - opm.blitzed.org is checked from within Sendmail.  matching
     messages are rejected outright.

   - a small amount of blocking is done via TCP_Wrappers or Sendmail's
     own access database.

   - Sendmail also performs basic sanity checks on envelope sender.
     (is the sender qualified with a domain name?  does the domain

   - Mimedefang is used to provide additional checks, many of which were
     written in-house: HELO/EHLO argument from system outside our network
     doesn't pretend to be one of our own systems (I'm also contemplating
     checking whether any systems on our network are claiming to be
     outside of our namespace, as that would likely help prevent outbound
     spam, but it may be more prone to false-positives), HELO/EHLO
     argument is properly formatted (FQDN or bracketed dotted quad), ...

   - Mimedefang also checks the message against SpamAssassin, again with
     some additional rules that were created in-house, and others that
     were added from contributions (via the spamassassin-users mailing
     list, for example).  We currently reject at 11.4, and identify as
     "will soon reject" at 11.2 (yes we've narrowed it down that far
     that 0.2 in the SpamAssassin score can make the difference in a
     false-positive), and add a "score" header (for users to use at
     their own discretion) at 5.0.

   - SpamAssassin checks various additional RBL's, most of which we
     found produced at least some false-positives but were still good
     enough to add to the probability that a message might be spam.

   - The bulk of human effort has been in creating and maintaining a
     list of regular expressions used by MimeDefang to try and identify
     (by PTR record, so obviously not fool-proof) what are probably
     client systems connecting directly to our mail servers rather than
     to their own ISPs' mail servers.  This is effective, but a _lot_ of

   - MimeDefang checks with the mail-delivery hosts prior to accepting
     messages, to avoid accepting for non-deliverable recipients then
     being stuck trying to send the bounces.

   - MimeDefang checks any messages that are accepted beyond the above,
     against Sophos (via Sophie) for virus-detection.  Matching messages
     are silently dropped (but logged).

> We do have Bayes enabled as well as a number of other lists.

I'm interested to know how you perform training of a Bayesian filter
acting globally.  That's a problem we would love to be able to solve

> So, given the architecture, (which from what I see is very close to
> what some companies are doing with their appliances) how does this
> compare to what others are doing?

There are two things that we would like to be able to implement here,
but both present certain technical challenges (some less difficult to
overcome, at least in theory, than others):

  Bayesian classification: it has occured to us that we very likely
  would need to have per-recipient data for this to be truly effective.
  That leads us to a user-training problem which is likely the most
  difficult challenge to overcome.

  Selective greylisting: note "selective".  There are some sources of
  mail we have that we (our user community) would prefer to not have
  delayed, even if we're talking about a single queue-delay per
  sender/recipient pair in a given period of time.  This is likely not
  that difficult for us to implement, but it's of course trivial for
  spammers to circumvent.  Still we believe that at this time, it would
  prove to be very effective, and this will likely happen in the
  forseeable future.

Sylvain Robitaille                              syl at alcor.concordia.ca

Systems and Network analyst / (ex)Postmaster      Concordia University
Instructional & Information Technology        Montreal, Quebec, Canada

More information about the unisog mailing list