So What Makes a Good Spam Filter Anyway?
By Alan Hearnshaw
This is not just a
rhetorical question. It is a question that many users – and many developers - do
not ask, and consequently, it largely remains unanswered. Maybe this could
be better answered by defining here the qualities of the perfect spam filter.
We’ll call our perfect spam filter the “SpamSplatter 3000”. Here are some of
the defining qualities of “SpamSplatter 3000” That’s it. Not
much of a shopping list is it? Of course,
“SpamSplatter 3000” hasn’t been invented yet (and if it does, I want a piece of
the action), but it does give us a frame of reference when looking for the best
filter we can find. Let’s take each
point in turn: There are two
kinds of filters that come near to this ideal currently: Bayesian Filters and Community
Filters. Bayesian filters strip messages down to small “word bites”, or tokens and maintain a database containing lists
of good and bad tokens. When a new message is encountered, the filter strips
this message down to tokens, compares it to the database, and applies a formula
based on the British scientist Alan Bayes’ formula for probability calculation. Community Filters simply work on a voting system whereby
every user that receives a spam message “votes” it as spam. This information is
stored on a central server and when enough votes are received the message is
banned from all users in the community. As can be seen,
the user interaction from these types of filters is mainly limited to two button
operation – correcting wrongly identified messages – and the more accurate the
filter, the less those buttons are used. OK, so that’s
pretty good. Not exactly zero interaction, but if the filter is accurate
enough, then it should be pretty near. That brings us to point two: This is the area
in which most spam filter development is concentrating and things are getting
pretty good nowadays. It is not at all unusual to see an efficient modern
filter achieve accuracy of 96% or better. It is, of course, far better to have
a false negative than a false positive if you are ever going to tear yourself
away from the killed mail folder! Of course, by
definition, community filters cannot reach 100% accuracy as someone has to be
getting the spam to be voting it as such! Theoretically, a
Bayesian filter may be able to eventually
get quite close to 100% accuracy, so at least there is hope there. Content based
filters (those that look for certain words, phrases or other indicators in a
message to identify it as spam), will almost certainly not get much higher
accuracy figures than the best of them can achieve today. Adapting to changing spam
requires new filters to be created on an ongoing basis. And finally, we
come to the holy grail of spam filtering: Strangely enough,
not enough work seems to be done in trying to achieve this goal. Some of the
best filters on the market today identify spam with impressive accuracy and
then simply place them in a “killed mail” folder for your later perusal. Now, forgive me if
I’m missing something here, but isn’t the point to save you having to wade
through the junk mail? Isn’t that what you bought the filter for? With the
“SpamSplatter 3000”, you don’t need to do that. As we haven’t
achieved 100% accuracy yet (and probably never will), the only way to free us
from checking the killed mail folder is a challenge/response system. This is
where a message is automatically sent back to the sender requiring them to take
some action for their message to actually be delivered. Some systems tend
to go overboard with the challenge/response system. These systems - often
called “Whitelist” systems - block messages from anyone that isn’t in the
user’s friends list. Guaranteed 100% effective, but too drastic a measure for
most users. Now, it seems that
the most intelligent use of this system would be to send challenges only to
messages that were flagged as “questionable”. Good message can be delivered,
definite spam can be deleted and questionable ones would earn themselves a
challenge message. So, to sum up,
let’s rewrite the qualities of our perfect filter and get a shopping list of
what to look for while we wait for the “SpamSplatter 3000” to arrive: It’s simple
really. Now, who’s going to build me this “SpamSplatter
3000”…?Spam Filters. Most of us know
we need one. Some of us know we need a better one, but how many stop to think what
actually makes a good spam filter in the first place?
It requires zero
interaction from the user
Over time, the Bayesian filter “learns” the characteristics of spam messages.
It produces zero false
positives or negatives
It is transparent
alan@whichspamfilter.com