Improving the effectiveness and accuracy of SpamAssassin updating its rules automatically on Debian

One can improve the default effectiveness and accuracy of SpamAssassin on Debian systems by automatically updating its rules from official channel and from suggested channel.
This tutorial will show how to update the rules and include the Sought automatically generated daily rules from messages caught in spam traps.
Also, read the other "Related Content" articles at this site regarding antispam and SpamAssassin linked.

vi /etc/default/spamassassin
# Cronjob
# Set to anything but 0 to enable the cron job to automatically update
# spamassassin's rules on a nightly basis
#AFM 20150723 https://wiki.apache.org/spamassassin/ImproveAccuracy
CRON=1

mkdir ~/spamassassin
cd ~/spamassassin/
mkdir /etc/spamassassin/sa-update-keys
chmod go-rx /etc/spamassassin/sa-update-keys
mkdir -p ~/temp/etc
cd ~/temp/etc
cp -pr /etc/spamassassin .
ls -lh /var/lib/spamassassin/
ls -lh /var/lib/spamassassin/sa-update-keys/
ls -lh /var/lib/spamassassin/3.004000/
ls -lh /var/lib/spamassassin/3.004000/updates_spamassassin_org/
mkdir -p ~/temp/var/lib
cp -pr /var/lib/spamassassin ~/temp/var/lib/
ls -lah ~/temp/var/lib/spamassassin
wget http://spamassassin.apache.org/updates/GPG.KEY
sa-update --import GPG.KEY
mv GPG.KEY spamassassinGPG.KEY
sa-update --checkonly -v
sa-update -v  --channel updates.spamassassin.org
ls -lah /var/lib/spamassassin/3.004000/updates_spamassassin_org
invoke-rc.d spamassassin reload


#You can now install Sought rules:
wget http://yerp.org/rules/GPG.KEY
sa-update --import GPG.KEY
mv GPG.KEY soughtGPG.KEY
sa-update --checkonly -v
sa-update -v  --gpgkey 6C6191E3 --channel sought.rules.yerp.org  --channel updates.spamassassin.org
ls -lah /var/lib/spamassassin/3.004000/sought_rules_yerp_org/
invoke-rc.d spamassassin reload
#sa-update && /etc/init.d/spamassassin reload
less /var/lib/spamassassin/3.004000/sought_rules_yerp_org/20_sought.cf
cat /var/lib/spamassassin/3.004000/updates_spamassassin_org/STATISTICS-set0-72_scores.cf.txt
##### WITH NEW RULES AND SCORES #####
# SUMMARY for threshold 5.0:
# Correctly non-spam: 135863  39.432%  (97.611% of non-spam corpus)
# Correctly spam:     149688  43.444%  (72.889% of spam corpus)
# False positives:      3325  0.965%  (2.389% of nonspam, 146801 weighted)
# False negatives:     55677  16.159%  (27.111% of spam, 139536 weighted)
# Average score for spam:  10.0    nonspam: 1.0
# Average for false-pos:   6.0  false-neg: 2.5
# TOTAL:              344553  100.00%
Reading scores from "tmprules"...
Reading per-message hit stat logs and scores...
# SUMMARY for threshold 5.0:
# Correctly non-spam:  16997  97.42%
# Correctly spam:      18797  73.13%
# False positives:       450  2.58%
# False negatives:      6908  26.87%
# TCR(l=50): 0.874082  SpamRecall: 73.126%  SpamPrec: 97.662%
##### WITHOUT NEW RULES AND SCORES #####
Reading scores from "../rules-base"...
Reading per-message hit stat logs and scores...
# SUMMARY for threshold 5.0:
# Correctly non-spam: 135534  97.37%
# Correctly spam:      56405  27.47%
# False positives:      3654  2.63%
# False negatives:    148960  72.53%
# TCR(l=50): 0.619203  SpamRecall: 27.466%  SpamPrec: 93.916%
Reading scores from "../rules-base"...
Reading per-message hit stat logs and scores...
# SUMMARY for threshold 5.0:
# Correctly non-spam:  17011  97.50%
# Correctly spam:       7152  27.82%
# False positives:       436  2.50%
# False negatives:     18553  72.18%
# TCR(l=50): 0.637003  SpamRecall: 27.823%  SpamPrec: 94.254%

cat  /var/lib/spamassassin/3.004000/updates_spamassassin_org/STATISTICS-set1-72_scores.cf.txt
##### WITH NEW RULES AND SCORES #####
# SUMMARY for threshold 5.0:
# Correctly non-spam: 154663  41.631%  (99.548% of non-spam corpus)
# Correctly spam:     106767  28.739%  (49.397% of spam corpus)
# False positives:       703  0.189%  (0.452% of nonspam,  57031 weighted)
# False negatives:    109374  29.441%  (50.603% of spam, 220677 weighted)
# Average score for spam:  8.9    nonspam: -0.5
# Average for false-pos:   5.8  false-neg: 2.0
# TOTAL:              371507  100.00%
Reading scores from "tmprules"...
Reading per-message hit stat logs and scores...
# SUMMARY for threshold 5.0:
# Correctly non-spam:  19456  99.51%
# Correctly spam:      13315  49.17%
# False positives:        95  0.49%
# False negatives:     13766  50.83%
# TCR(l=50): 1.462573  SpamRecall: 49.167%  SpamPrec: 99.292%
##### WITHOUT NEW RULES AND SCORES #####
Reading scores from "../rules-base"...
Reading per-message hit stat logs and scores...
# SUMMARY for threshold 5.0:
# Correctly non-spam: 154853  99.67%
# Correctly spam:      87475  40.47%
# False positives:       513  0.33%
# False negatives:    128666  59.53%
# TCR(l=50): 1.400639  SpamRecall: 40.471%  SpamPrec: 99.417%
Reading scores from "../rules-base"...
Reading per-message hit stat logs and scores...
# SUMMARY for threshold 5.0:
# Correctly non-spam:  19484  99.66%
# Correctly spam:      10975  40.53%
# False positives:        67  0.34%
# False negatives:     16106  59.47%
# TCR(l=50): 1.391910  SpamRecall: 40.527%  SpamPrec: 99.393%

ls -lah /var/lib/spamassassin/3.004000/updates_spamassassin_org
less /var/lib/spamassassin/3.004000/updates_spamassassin_org/50_scores.cf
less /var/lib/spamassassin/3.004000/updates_spamassassin_org/72_scores.cf

Now that you have manually tested the update, you have to adjust permissions to leave the spamassassin daily cronjob  update the rules automatically for you
chown -R debian-spamd:debian-spamd /var/lib/spamassassin
chown -R debian-spamd:debian-spamd /etc/spamassassin/sa-update-keys/
chown -R debian-spamd:debian-spamd /etc/spamassassin/sa-update-hooks.d/
su - debian-spamd -c "/usr/bin/sa-update -v --gpghomedir /var/lib/spamassassin/sa-update-keys"
sh -x /etc/cron.daily/spamassassin

Verify it will run daily.
IF your machine is not running 24 hours per day you must install anacron.
apt-get install anacron
run-parts -v --report /etc/cron.daily

Next day, at 06:25 am on Debian, your rules will be updated automagically.
Verify it next day by reading /var/log/syslog and /var/log/cron.log
less /var/log/syslog
less /var/log/cron.log



Bibliography

Comentários

Postagens mais visitadas deste blog

Tutorial Cyrus IMAP aggregator (murder) 2.3.16 sobre Debian GNU Linux 5.x Lenny

How to configure multipath for high availability and performance on Debian and CentOS for storage at IBM DS8300 SAN

Como instalar Oracle Client no Debian e Ubuntu