Mini how-to Kmail and SpamOracle
This is a description of how to configure KMail to filter spam using SpamOracle (Bayesian self-learning algorithm filtering for English and non-English messages) on IMAP and POP accounts without using procmail.
Introduction:
SpamOracle implements naive Bayesian self learning algorithm, for highly efficient and very fast English and non-English message filtering.
Actually, by October 2002, it was the first known Bayesian spam filter to accomplish this. At August 2003, bogofilter also filters non english too.
First, you MUST follow instructions of the README file from the SpamOracle software at http://cristal.inria.fr/~xleroy/software.html for compilation and installation.
You will find link to the theory involved on that page.
Obtain a spam reporting e-mail account:
This is optional but VERY interesting in order to stop spammers at their ISP.
Before starting to configuring Kmail and SpamOracle, obtain a spam reporting email account from a spam blocker service.
I use http://spamcop.net and foward as attachments the spam I receive.
It is the easiest and powerfull spam reporting service I ever saw, but you can use another or fully commercial service as well.
Procedures:
In KMail, you will create two folders (maildir format) containing your samples of nonspam and spam messages.
These samples will be used to teach SpamOracle about what is good and what is bad.
Name these new folders (maildir format) "nonspam" and "spam".
Maildir folders are much less prone to indexing errors.
Carefully hand pick as many spam messages from your regular mail accounts and copy/move to the new spam folder.
See how many they are. KMail shows this number in the status bar (lower left corner).
Mark all of them as read.
Now, one of the tricks for good filtering: choose AT LEAST twice that many good messages from your accounts and copy them to the nonspam folder. Better yet, collect three times as many good messages as you have spam messages.
Choose those good messages that are typical examples of the good ones you receive. This way, the program will learn YOUR pattern.
Clearly, if your normal good messages are similar to spammy ones, the program will have a hard time trying to figure out the good pattern against bad pattern.
Be realistic, however, choosing messages. The program must learn YOUR "good message" pattern.
Create the database as described in the spamoracle readme file.
Apply the tests and exclude/include messages you point by the test results on terminal window as not good typical examples (again, see the README file).
But don't exclude examples of spam manually verified. SpamOracle must learn how the spam you receive looks like.
This filtering method was throughly tested on IMAP accounts and also works well on POP accounts. (read KMail help regarding pipe through filtering).
You are ready to configure KMail for the initial filtering. After this, spamoracle will move found spam to the spam folder.
The concepts:
The SpamOracle filters will be applied after discussion lists, e-newsletters, known trusted senders, etc, have been matched by simple filters. This is because SpamOracle filters are more cpu intensive than those other filters.
One filter will test if the given message was already marked as good or spam. If is wasn't not marked, it will test it and mark as good or spam. Messages previously marked as unknown could be tested again, as spam db evolves.
The other filter will verify if the msg is marked as spam and if so, it will remove added X-Spam and X-Attachment headers and move it to the spam folder.
An important decision, based on the pattern of your spam: you should decide if spamoracle will analyze body and ATTACHMENTS, or body only (headers are always analysed).
Keep in mind that this can impose heavy cpu usage. Let's say you receive spam with 3 megabytes of attachments every day. It may be more exact to analyse attachments also.
But many spam these days are short and have enough words in header and body to catch them. Some are now using the attachment trick to escape from common spam filtering. A few are using only images, no body words to escape from spam filtering. You will have to experiment.
Two final procedures are optional.
Do not bounce message to the sender and reporting spam to a spam blocking service.
Most of spam these days use false or forged sender addresses, or even worse: real innocent third party addresses.
NEVER answer a spam or try to 'unsubscribe' from their database.
They will have a sure confirmation of your address and send even more spam.
The spam reporting to a service is also optional.
But you could help to include them on RBLs, and other spam blocking systems.
The steps:
Menu Configuration > configure filters
New filter > rename
rename it as "SpamOracle"
move filter to bottom, after mailing lists, newsletters, trusted senders filters.
Filter criteria: match any of the following rules
any header search for regular expression .*
Filter action:
remove header X-Spam
remove header X-Attachments
pipe through spamoracle mark
Advanced options:
apply this filter: for received messages
for outgoing messages (don't care)
manual filtering
UNMARK if this filter match, stop processing here.
Create another new filter:
new filter > rename
rename it as "SpamOracle_is_spam"
move filter to bottom, after the previously created one.
Filter criteria: match all of the following rules
X-Spam contain yes;
Filter action:
remove header X-Spam
remove header X-Attachments
move to folder spam
mark as unread
Advanced options:
apply this filter: for received messages
for outgoing messages (don't care)
manual filtering
MARK if this filter match, stop processing here.
Manual step (optional) for spam reporting
Until a foward as attachment filter action become available, you will have to:
go to the spam folder
select all new arrived and correctly identified spam.
Foward them as attachment in a message to your special spam reporting e-mail account.
Follow additional steps from your spam reporting service.
Tips:
If you decide to analyze attachments also, configure the command line accordingly, as described in the SpamOracle README file.
By the filter configuration, all messages moved to spam folder are marked as unread and will be easy to find.
Open the spam folder to verify the correctness of classification.
You should verify this spam folder each day, at least, looking for the unread new messages moved to it and verifying its correctness of classification. Most of us simply can not afford to loose any "good" message. It could be that big customer request...
Any good message moved to spam folder should be manually moved to nonspam folder. Then, remove old database and recreate the database.
Each day, as new spam messages are added to spam folder, remove old database and recreate the database. So spamoracle will be improving its patterns. At least, recreate database at each misclassification you find. It could be a false positive or a passed spam.
Keep an eye at message numbers as previously recommended. Add new typical and not so typical good messages to the nonspam folder. Instead of deleting the "good" messages already read, move them to nonspam folder.
Hints from Roger Chrisman (rogerhc at pacbell.net) :
The regular expression used is a period + an asterisk (.*). Details, details...
Bogofilter will not work if it is not into a directory in the path or called with path information. Example: /usr/bin/spamoracle
Building database
For maildir folder use, you need a different command to invoke SpamOracle. This command is described at the end of spamoracle man page. You could rebuild your database invoking this command below:
rm ~/.spamoracle.db && find ~/Mail/nonspam -type f -print | xargs spamoracle add -v -good && find ~/Mail/spam -type f -print | xargs spamoracle add -v -spam
Real life effectiveness:
For good chosen 300 nonspam message, and for around 50 spam messages, you could expect around 80% spam catch and some false positives.
For good chosen 500 nonspam message, and for around 100 spam messages, you could expect around 90% spam catch and ZERO false positives.
For good chosen 1500 nonspam messages, and for around 500 spam messages, you could expect around 98% spam catch and ZERO false positives.
Your mileage will vary (remember the your spam / nonspam profile issue), but you get the picture.
The more you use and train it, the better the results will be. (bigger db = more cpu usage).
Do not expect to reach absolute 100%, because the spam messages does not use only "exclusive" words. And if you are a salesman, expect much more lower efficiency. Because your "good" messages will be very similar to "bad" messages. So, EXPECT some false positives and be very careful, verifying your spam folder each day to not loose that multibillion purchase order.
IMAP issues with Kmail and SpamOracle:
IMAP filtering with KMail only works applying manually (selecting msg to process and ctrl+j, or pressing button bar, or menu message > apply filters).
KMail also can only move msg from the remote inbox to local folders. But it is not a limitation for this SpamOracle filter, as it needs local folders and files already.
Be prepared, KMail downloads and uploads all messages and generates lots of traffic with IMAP filtering. But it works without procmail and you keep your messages on the secure world available server.
Combined spam filtering
Today I am using a combined spam filtering solution for filtering spam. My ISP tags message headers using SpamAssassin. Then, locally, I use at this sequence: bogofilter, spamoracle and a customized rblfilter.
These programs deserve their own mini how to, and the combined solution also. Soon visit http://www.andrefelipemachado.hpg.com.br/linux/index.html (my personal Brazilian portuguese linux page) looking for the links to each one. I will publish them.
Contact, suggestions, corrections:
Comentários
Usuários registrados têm permissão para criar comentários.
Translate this page.
Stumble It!
Save to del.icio.us
Add to Free Sw Daily
Add to Technorati Favorites