Training Rspamd Naive-Bayes

1. Collect samples
  • Save spam emails into a folder, e.g. spam/
  • Save legitimate emails into a folder, e.g. ham/
2. Move the data into the container

Assuming your folders are on the host system, copy them into the rspamd container:

docker cp spam/ rspamd:/tmp/spam
docker cp ham/ rspamd:/tmp/ham
3. Install the rspamd client inside the container

docker exec -it rspamd apk add rspamd-client

4. Check rspamd Statistics

docker exec -it rspamd rspamc -P tuxguard stat

Look for the line total learns: to see how many samples have been processed so far.

5. Train the model with your samples

Note, that you need a minimum of 200 ham samples to enable the bayes model.

docker exec -it rspamd rspamc -P tuxguard learn*spam /tmp/spam/*.eml
docker exec -it rspamd rspamc -P tuxguard learn*ham /tmp/ham/*.eml
6. Verify training

docker exec -it rspamd rspamc -P tuxguard stat

The total learns count should increase.

Example with downloaded spam messages

If you don’t yet have enough of your own spam samples, you can bootstrap training using a public dataset, e.g. https://untroubled.org/spam/:

Download a sample spam archive

cd /tmp
curl -o 2022.7z <https://untroubled.org/spam/2022.7z>

Extract the archive (Note you might need to install 7z: dnf install epel-release; dnf install p7zip)

7za x 2022.7z

Copy extracted messages into the container

docker cp 2022/ rspamd:/tmp/spam

Train on these samples

docker exec -it rspamd rspamc -P tuxguard learn_spam /tmp/spam/\*

After this, run rspamc stat again and confirm that the number of learned samples has gone up.

Afterward Rspamd will report the Bayes symbol in the log:

bayes_training_1.png

bayes_training_2.png