Spam Filters Explained
By Alan Hearnshaw
What do they do? How do they work?
Which one is right for me? Spam is a very real problem that many
people have to deal with on a daily basis. For those that have decided to do
something about it and start to investigate the options available in spam
filtering, this article provides a brief introduction to your options and the
types of spam filters available. Despite the bewildering array of spam
filters available today, all claiming to the best one “of its kind” there are
really just five filtering methodologies in general use today and all products
rely on one, or a combination of these: “In the beginning, there were
content-based filters.” These filters scan the contents of the and
look for tell-tale signs that the message is spam. In the early days of
spamming it was quite simple to look out for “Kill Words” such as Very soon though, spammers got wise to this
and started resorting to all kinds of tricks to get their message past the
filters. The days of “obfuscation” had begun. This rendered basic content-based filters
somewhat ineffective, although there are one or two on the market now that are
clever enough to “see through” theses attempts and still provide good results. “The Reverend Bayes comes to the rescue” Born in London 1702, the son of a minister, Thomas Bayes
developed a formula which allowed him to determine the probability of an event
occurring based on the probabilities of two or more independent evidentiary
events. Bayesian filters “learn” from
studying known good and bad messages. Each message is split into single “word
bytes”, or tokens and these tokens are placed into a database along with how
often they are found in each kind of message. When a new message arrives to be tested by the filter, the new message is
also split into tokens and each token is looked up in the database. Extrapolating
results from the database and applying a form of the good reverend’s formula,
know as the “Naive Bayesian” formula, the message is given a “spamicity” rating
and can be dealt with accordingly. Bayesian filters typically are
capable of achieving very good accuracy rates (>97% is not uncommon), and
require very little on-going maintenance. “Who goes there, friend or foe?” This very basic form of filtering is seldom
used on its own nowadays, but can be useful as part of a larger filtering
strategy. A “whitelist” is nothing more than a list
of e-mail addresses from which you wish to accept communications. A whitelist
filter would only accept messages from these people and all others would be
rejected A “blacklist”, conversely, is a list of
e-mail addresses - and sometimes IP Addresses (computer identification
addresses) - from which communications will not be accepted. While this may seem like a good idea from
the outset, a whitelist methodology is too restrictive for most people and, as
virtually all spam e-mails carry a forged “from” address, there is little point
in collecting this address to ban it in future as it is very unlikely to be the
same next time. There are bodies on the internet that
maintain a list of known “bad” sources of e-mail. Many filters today have the
ability to query these servers to see if the message they are looking at comes from
a source identified by this Internet-based blacklist, or RBL. While being quite
effective, they do tend to suffer from “false positives” where good messages
are incorrectly identified as spam. This happens often with newsletters. “Open sesame!” Challenge/Response filters are
characterised by their ability to automatically send a response to a previously
unknown sender asking them to take some further action before their message
will be delivered. This is often referred to as a "Turing Test" -
named after a test devised by British mathematician Alan Turing to determine if
machines could “think”. Recent years have seen the appearance of
some internet services which automatically perform this Challenge/Response
function for the user and require the sender of an e-mail to visit their web
site to facilitate the receipt of their message. Critics of this system claim it to be too
drastic a measure and that it sends a message that "my time is more
important than yours" to the people trying to communicate with you. For some low traffic e-mail users though,
this system alone may be a perfectly acceptable method of completely
eliminating spam from their inbox - one step above the "Whitelist"
system outlined above. “A united front” These types of filters work on the
principal of "communal knowledge" of spam. When a user receives a
spam message, they simply mark it as such in their filter. This information is
sent to a central server where a “fingerprint” of the message is stored. After enough people have “voted” this
message to be spam, then it is stopped from reaching all the other people in
the community. This type of filtering can prove to be
quite effective, although it stands to reason that it can never be 100%
effective as a few people have to receive the spam for it to be “flagged” in
the first place. Just like its similar cousin the Internet black list (RBL), this
system also can suffer from “false positives”, or messages incorrectly
identified as spam. Hopefully you are now armed with a little more
information to be able to make an informed decision on the best spam filter for
you. For further information, consider reading the
reviews and articles found at http://www.whichspamfilter.com
By Alan Hearnshaw
Content-Based Filters
”Lose Weight” and mark a message as spam if it was found.
We started getting messages containing the phrase “L0se Welght” (Notice the
zero for “o” and “l” for “i”) and even more bizarre – and sometimes quite ingenious
– variations.Bayesian Based Filters
Whitelist/Blacklist Filters
Challenge/Response Filters
Community Filters
Alan Hearnshaw is the owner of http://www.whichspamfilter.com, a
web site which conducts weekly in-depth reviews of current spam filters,
provides help and guidance in the fight against spam and provides a useful
community forum.
alan@whichspamfilter.com