Novos blogs:  Debian Day 2008 Porto Alegre  Existem vírus para Linux?  Casos de sucesso com Linux  ganhar dinheiro Linux (p21): abrir ou não abrir?  Jogos no Linux


Tech Force / Quem somos / Linux / Mini how-to Kmail and SpamOracle



Right menu

Linux blog recente

Debian Day Brasil 2008, RS, Porto Alegre

Os usuários do Projeto Debian avançam para comemorar o 15º aniversário pelo mundo todo dia 16 de agosto.

E você é nosso convidado para participar do Debian Day 2008 , RS, Porto Alegre.

Não conectado

Notificação


Mini how-to Kmail and SpamOracle

This is a description of how to configure KMail to filter spam using SpamOracle (Bayesian self-learning algorithm filtering for English and non-English messages) on IMAP and POP accounts without using procmail.

Introduction:

SpamOracle implements naive Bayesian self learning algorithm, for highly efficient and very fast English and non-English message filtering.

Actually, by October 2002, it was the first known Bayesian spam filter to accomplish this. At August 2003, bogofilter also filters non english too.

First, you MUST follow instructions of the README file from the SpamOracle software at http://cristal.inria.fr/~xleroy/software.html for compilation and installation.

You will find link to the theory involved on that page.

Obtain a spam reporting e-mail account:

This is optional but VERY interesting in order to stop spammers at their ISP.

Before starting to configuring Kmail and SpamOracle, obtain a spam reporting email account from a spam blocker service.

I use http://spamcop.net and foward as attachments the spam I receive.

It is the easiest and powerfull spam reporting service I ever saw, but you can use another or fully commercial service as well.

Procedures:

In KMail, you will create two folders (maildir format) containing your samples of nonspam and spam messages.

These samples will be used to teach SpamOracle about what is good and what is bad.

Name these new folders (maildir format) "nonspam" and "spam".

Maildir folders are much less prone to indexing errors.

Carefully hand pick as many spam messages from your regular mail accounts and copy/move to the new spam folder.

See how many they are. KMail shows this number in the status bar (lower left corner).

Mark all of them as read.

Now, one of the tricks for good filtering: choose AT LEAST twice that many good messages from your accounts and copy them to the nonspam folder. Better yet, collect three times as many good messages as you have spam messages.

Choose those good messages that are typical examples of the good ones you receive. This way, the program will learn YOUR pattern.

Clearly, if your normal good messages are similar to spammy ones, the program will have a hard time trying to figure out the good pattern against bad pattern.

Be realistic, however, choosing messages. The program must learn YOUR "good message" pattern.

Create the database as described in the spamoracle readme file.

Apply the tests and exclude/include messages you point by the test results on terminal window as not good typical examples (again, see the README file).

But don't exclude examples of spam manually verified. SpamOracle must learn how the spam you receive looks like.

This filtering method was throughly tested on IMAP accounts and also works well on POP accounts. (read KMail help regarding pipe through filtering).

You are ready to configure KMail for the initial filtering. After this, spamoracle will move found spam to the spam folder.

The concepts:

The SpamOracle filters will be applied after discussion lists, e-newsletters, known trusted senders, etc, have been matched by simple filters. This is because SpamOracle filters are more cpu intensive than those other filters.

One filter will test if the given message was already marked as good or spam. If is wasn't not marked, it will test it and mark as good or spam. Messages previously marked as unknown could be tested again, as spam db evolves.

The other filter will verify if the msg is marked as spam and if so, it will remove added X-Spam and X-Attachment headers and move it to the spam folder.

An important decision, based on the pattern of your spam: you should decide if spamoracle will analyze body and ATTACHMENTS, or body only (headers are always analysed).

Keep in mind that this can impose heavy cpu usage. Let's say you receive spam with 3 megabytes of attachments every day. It may be more exact to analyse attachments also.

But many spam these days are short and have enough words in header and body to catch them. Some are now using the attachment trick to escape from common spam filtering. A few are using only images, no body words to escape from spam filtering. You will have to experiment.

Two final procedures are optional.

Do not bounce message to the sender and reporting spam to a spam blocking service.

Most of spam these days use false or forged sender addresses, or even worse: real innocent third party addresses.

NEVER answer a spam or try to 'unsubscribe' from their database.

They will have a sure confirmation of your address and send even more spam.

The spam reporting to a service is also optional.

But you could help to include them on RBLs, and other spam blocking systems.

The steps:

Menu Configuration > configure filters

New filter > rename

rename it as "SpamOracle"

move filter to bottom, after mailing lists, newsletters, trusted senders filters.

Filter criteria: match any of the following rules

any header search for regular expression .*

Filter action:

remove header X-Spam

remove header X-Attachments

pipe through spamoracle mark

Advanced options:

apply this filter: for received messages

for outgoing messages (don't care)

manual filtering

UNMARK if this filter match, stop processing here.

Create another new filter:

new filter > rename

rename it as "SpamOracle_is_spam"

move filter to bottom, after the previously created one.

Filter criteria: match all of the following rules

X-Spam contain yes;

Filter action:

remove header X-Spam

remove header X-Attachments

move to folder spam

mark as unread

Advanced options:

apply this filter: for received messages

for outgoing messages (don't care)

manual filtering

MARK if this filter match, stop processing here.

Manual step (optional) for spam reporting

Until a foward as attachment filter action become available, you will have to:

go to the spam folder

select all new arrived and correctly identified spam.

Foward them as attachment in a message to your special spam reporting e-mail account.

Follow additional steps from your spam reporting service.

Tips:

If you decide to analyze attachments also, configure the command line accordingly, as described in the SpamOracle README file.

By the filter configuration, all messages moved to spam folder are marked as unread and will be easy to find.

Open the spam folder to verify the correctness of classification.

You should verify this spam folder each day, at least, looking for the unread new messages moved to it and verifying its correctness of classification. Most of us simply can not afford to loose any "good" message. It could be that big customer request...

Any good message moved to spam folder should be manually moved to nonspam folder. Then, remove old database and recreate the database.

Each day, as new spam messages are added to spam folder, remove old database and recreate the database. So spamoracle will be improving its patterns. At least, recreate database at each misclassification you find. It could be a false positive or a passed spam.

Keep an eye at message numbers as previously recommended. Add new typical and not so typical good messages to the nonspam folder. Instead of deleting the "good" messages already read, move them to nonspam folder.

Hints from Roger Chrisman (rogerhc at pacbell.net) :

The regular expression used is a period + an asterisk (.*). Details, details...

Bogofilter will not work if it is not into a directory in the path or called with path information. Example: /usr/bin/spamoracle

Building database

For maildir folder use, you need a different command to invoke SpamOracle. This command is described at the end of spamoracle man page. You could rebuild your database invoking this command below:

rm ~/.spamoracle.db && find ~/Mail/nonspam -type f -print | xargs spamoracle add -v -good && find ~/Mail/spam -type f -print | xargs spamoracle add -v -spam

Real life effectiveness:

For good chosen 300 nonspam message, and for around 50 spam messages, you could expect around 80% spam catch and some false positives.

For good chosen 500 nonspam message, and for around 100 spam messages, you could expect around 90% spam catch and ZERO false positives.

For good chosen 1500 nonspam messages, and for around 500 spam messages, you could expect around 98% spam catch and ZERO false positives.

Your mileage will vary (remember the your spam / nonspam profile issue), but you get the picture.

The more you use and train it, the better the results will be. (bigger db = more cpu usage).

Do not expect to reach absolute 100%, because the spam messages does not use only "exclusive" words. And if you are a salesman, expect much more lower efficiency. Because your "good" messages will be very similar to "bad" messages. So, EXPECT some false positives and be very careful, verifying your spam folder each day to not loose that multibillion purchase order.

IMAP issues with Kmail and SpamOracle:

IMAP filtering with KMail only works applying manually (selecting msg to process and ctrl+j, or pressing button bar, or menu message > apply filters).

KMail also can only move msg from the remote inbox to local folders. But it is not a limitation for this SpamOracle filter, as it needs local folders and files already.

Be prepared, KMail downloads and uploads all messages and generates lots of traffic with IMAP filtering. But it works without procmail and you keep your messages on the secure world available server.

Combined spam filtering

Today I am using a combined spam filtering solution for filtering spam. My ISP tags message headers using SpamAssassin. Then, locally, I use at this sequence: bogofilter, spamoracle and a customized rblfilter.

These programs deserve their own mini how to, and the combined solution also. Soon visit http://www.andrefelipemachado.hpg.com.br/linux/index.html (my personal Brazilian portuguese linux page) looking for the links to each one. I will publish them.

Contact, suggestions, corrections:

contact form

Comentários

Usuários registrados têm permissão para criar comentários.


Translate this page.  

Slashdot   Slashdot It!     Digg it   StumbleUpon Stumble It!    Save to del.icio.us   Add to Propeller  Submit To Propeller    Add to Free Software Daily Add to Free Sw Daily   Add to Technorati Favorites  Add to Technorati Favorites   AddThis Social Bookmark Button    AddThis Feed Button    Adicionar esta notícia no Rec6   Adicionar esta notícia no Linkk   Adicionar esta notícia no diHITT   Adicionar esta notícia no Uêba   Adicionar esta notícia no LinkLoko       Adicionar esta notícia ao Brasil Source   enviar para DoMelhor