Novos blogs:  Casos de sucesso com Linux  cursos Linux on-line  apresentação Projeto Debian e distro Debian  SERPRO escolhe Debian  Debian Day 2008 Porto Alegre


Tech Force / Quem somos / Linux / Mini how-to Kmail and Bogofilter



Right menu

Linux blog recente

Casos de sucesso com Linux

Empresas, governos e instituições, desde as pequenas até as gigantescas, JÁ ESTÃO colhendo os frutos do uso do Linux.
Quem participa além de usar, ganha ainda mais.

Não conectado

Notificação


Mini how-to Kmail and Bogofilter

Purpose:

This is a description of how to configure KMail to filter spam using Bogofilter (Robinson-Fischer self-learning algorithm filtering for English and non-English messages) on IMAP and POP accounts without using procmail.

Introduction:

Bogofilter implements Robinson-Fischer self learning algorithm, for highly efficient and fast English and non-English message filtering.

At August 2003, bogofilter and SpamOracle are the known filters to accomplish this.

First, you must read extensive documentation and tuning tips (optional, for optimal results) and download a binary or compile your own from http://bogofilter.sourceforge.net .

In this document, I will not cover bogofilter tuning by tweaking its configuration file. All steps will use program defaults.

This mini how to leverages previous documents from Thomas Strauß, thst AT strauss-it.de , bharnish AT technologist.com and my own mini how to for Kmail and SpamOracle, all published at http://kmail.kde.org .

Obtain a spam reporting e-mail account:

This is optional but VERY interesting in order to stop spammers at their ISP.

Before starting to configuring Kmail and bogofilter, obtain a spam reporting email account from a spam blocker service.

I use http://spamcop.net and foward as attachments the spam I receive.

It is the easiest and powerfull spam reporting service I ever saw, but you can use another or fully commercial service as well.

Procedures:

In KMail, you will create two folders (maildir format) containing your samples of nonspam and spam messages.

These samples will be used to teach bogofilter about what is good and what is bad.

Name these new folders (maildir format) "nonspam" and "spam".

Maildir folders are much less prone to indexing errors.

Carefully hand pick as many spam messages from your regular mail accounts and copy/move to the new spam folder.

See how many they are. KMail shows this number in the status bar (lower left corner).

Mark all of them as read.

Now, one of the tricks for good filtering with bogofilter: choose as many good messages as you have spam messages from your accounts and copy them to the nonspam folder.

Choose those good messages that are typical examples of the good ones you receive. This way, the program will learn YOUR pattern.

Clearly, if your normal good messages are similar to spammy ones, the program will have a hard time trying to figure out the good pattern against bad pattern.

Be realistic, however, choosing messages. The program must learn YOUR "good message" pattern.

Create the database as described in the bogofilter man page. (keep reading this doc to the end)

Apply the tests and exclude/include messages you point by the test results on terminal window as not good typical examples (again, see the bogofilter documentation).

But don't exclude examples of spam manually verified. bogofilter must learn how the spam you receive looks like.

This filtering method was throughly tested on IMAP accounts and also works well on POP accounts. (read KMail help regarding pipe through filtering).

You are ready to configure KMail for the initial filtering. After this, bogofilter will move found spam to the spam folder.

The concepts:

The bogofilter filters will be applied after discussion lists, e-newsletters, known trusted senders, etc, have been matched by simple filters. This is because bogofilter filters are more cpu intensive than those other filters.

One filter will remove eventual previous bogofilter headers and will test it and mark as good or spam. Messages previously marked as unknown could be tested again, as spam db evolves.

The other filter will verify if the msg is marked as spam and if so, it will remove added bogofilter headers and move it to the spam folder.

Two final procedures are optional.

Do not bounce message to the sender and reporting spam to a spam blocking service.

Most of spam these days use false or forged sender addresses, or even worse: real innocent third party addresses.

NEVER answer a spam or try to 'unsubscribe' from their database.

They will have a sure confirmation of your address and send even more spam.

The spam reporting to a service is also optional.

But you could help to include them on RBLs, and other spam blocking systems.

The steps:

Menu Configuration > configure filters

New filter > rename

rename it as "bogofilter"

move filter to bottom, after mailing lists, newsletters, trusted senders filters.

Filter criteria: match any of the following rules

any header search for regular expression .*

Filter action:

remove header X-Bogosity

remove header X-Attachments

pipe through bogofilter -epv

Advanced options:

apply this filter: for received messages

for outgoing messages (don't care)

manual filtering

UNMARK if this filter match, stop processing here.

Create another new filter:

new filter > rename

rename it as "bogofilter_is_spam"

move filter to bottom, after the previously created one.

Filter criteria: match all of the following rules

X-Bogosity contain Yes

Filter action:

remove header X-Bogosity

remove header X-Attachments

move to folder spam

mark as unread

Advanced options:

apply this filter: for received messages

for outgoing messages (don't care)

manual filtering

MARK if this filter match, stop processing here.

Manual step (optional) for spam reporting

Until a foward as attachment filter action become available, you will have to:

go to the spam folder

select all new arrived and correctly identified spam.

Foward them as attachment in a message to your special spam reporting e-mail account.

Follow additional steps from your spam reporting service.

Tips:

By the filter configuration, all messages moved to spam folder are marked as unread and will be easy to find.

Open the spam folder to verify the correctness of classification.

You should verify this spam folder each day, at least, looking for the unread new messages moved to it and verifying its correctness of classification. Most of us simply can not afford to loose any "good" message. It could be that big customer request...

Any good message moved to spam folder should be manually moved to nonspam folder. Then, remove old database and recreate the database.

Each day, as new spam messages are added to spam folder, remove old database and recreate the database. So bogofilter will be improving its patterns. At least, recreate database at each misclassification you find. It could be a false positive or a passed spam.

Keep an eye at message numbers as previously recommended. Add new typical and not so typical good messages to the nonspam folder. Instead of deleting the "good" messages already read, move them to nonspam folder.

Hints from Roger Chrisman (rogerhc at pacbell.net) :

The regular expression used is a period + an asterisk (.*). Details, details...

Bogofilter will not work if it is not into a directory in the path or called with path information. Example: /usr/bin/bogofilter

Building database

For maildir folder use, you need a different command to invoke bogofilter. You could rebuild your database invoking this command below:

rm ~/.bogofilter/wordlist.db && find ~/Mail/spam -type f | bogofilter -vvv -s -b && find ~/Mail/nonspam -type f | bogofilter -vvv -n -b

Beware: bogofilter database creation is a lenghty process. A 10500 spam and 10500 nonspam database creation took 1h30min using an Athlon 2.4 GHz with 512 MB ram.

The line below is MUCH more faster (works only with newest versions), took 15 minutes:

rm ~/.bogofilter/wordlist.db && nice find ~/Mail/spam -type f -print | xargs bogofilter -vvv -s -B && nice find ~/Mail/nonspam -type f -print | xargs bogofilter -vvv -n -B

Real life effectiveness:

For good chosen 300 nonspam message, and for around 50 spam messages, you could expect around 80% spam catch and some false positives.

For good chosen 500 nonspam message, and for around 100 spam messages, you could expect around 90% spam catch and ZERO false positives.

For good chosen 1500 nonspam messages, and for around 500 spam messages, you could expect around 98% spam catch and ZERO false positives.

Your mileage will vary (remember the your spam / nonspam profile issue), but you get the picture.

The more you use and train it, the better the results will be. (bigger db = more cpu usage).

Do not expect to reach absolute 100%, because the spam messages does not use only "exclusive" words. And if you are a salesman, expect much more lower efficiency. Because your "good" messages will be very similar to "bad" messages. So, EXPECT some false positives and be very careful, verifying your spam folder each day to not loose that multibillion purchase order.

IMAP issues with Kmail and bogofilter:

IMAP filtering with KMail only works applying manually (selecting msg to process and ctrl+j, or pressing button bar, or menu message > apply filters).

KMail also can only move msg from the remote inbox to local folders. But it is not a limitation for this bogofilter filter, as it needs local folders and files already.

Be prepared, KMail downloads and uploads all messages and generates lots of traffic with IMAP filtering. But it works without procmail and you keep your messages on the secure world available server.

Combined spam filtering

Today I am using a combined spam filtering solution for filtering spam. My ISP tags message headers using SpamAssassin. Then, locally, I use at this sequence: bogofilter, spamoracle and a customized rblfilter.

These programs deserve their own mini how to, and the combined solution also. Soon visit http://www.andrefelipemachado.hpg.com.br/linux/index.html (my personal Brazilian portuguese linux page) looking for the links to each one. I will publish them there. I will submit these documents for publishing at http://kmail.kde.org .

Contact, suggestions, corrections:

contact form

Comentários

Usuários registrados têm permissão para criar comentários.


Translate this page.  

Slashdot   Slashdot It!     Digg it   StumbleUpon Stumble It!    Save to del.icio.us   Add to Propeller  Submit To Propeller    Add to Free Software Daily Add to Free Sw Daily   Add to Technorati Favorites  Add to Technorati Favorites   AddThis Social Bookmark Button    AddThis Feed Button    Adicionar esta notícia no Rec6   Adicionar esta notícia no Linkk   Adicionar esta notícia no diHITT   Adicionar esta notícia no Uêba   Adicionar esta notícia no LinkLoko       Adicionar esta notícia ao Brasil Source   enviar para DoMelhor