понеделник, 9 февруари 2009 г.

CRM114 Spam Filtering Evaluation


Basics


What is CRM114?

CRM114 is a script (JIT-compiled) language, created for the specific purpose of classifying (text) data. In other words CRM programs usually solve the problem: "Given categories X, Y, Z which category does this piece of data D belong to?". It just so happens that this question matches the problem of email spam filtering.

The place on the web where you can learn more about CRM114 is the official site and wiki. Another important piece of documentation is the CRM114 Revealed book, which, among other things, covers the classifiers that are readily available in the language. To read about the classifiers jump straight to page 159.

What is the purpose of this evaluation?

The puspose is to get an empirical idea of how fast and accurate are the classifiers provided by CRM114, compared to the quite popular Spam Assassin filter. Another point on the agenda is estimating the slowdown imposed by running CRM114 code via a Python wrapper.

Note that none of the data presented here pretends to be statistically precise. It is a rough estimate, which you can use as a base for deciding whether to try CRM114 for yourself or not.


Test data and environment used for the evaluation

The bulk of the data came from the 2005 Trec Public Spam Corpus. I used 1000 spam and 1000 ham messages from this corpus to train the different CRM114 classifiers. Before any training occurred, 405 ham messages and 561 spam messages were taken out of the Public corpus, to be used as test material. I will refer to the 405 test ham messages as Pubham, and to the 561 spam messages as Pubspam (these are also the names of the folders on my system where the messages got stored).

The public spam corpus provided by Trec is a great resource; however, it is quite old. So, I hand-picked 100 spam and 100 ham messages from my personal mailbox, so I can test the filters with some "modern" emails. While the 100 spams are all recent (all received in Jan, 2009), the 100 hams range from 2004 - 2009. I will refer to these two groups of messages as Privham and Privspam (these names also match the corresponding folders on my system). Privham is the only group that contains large attachments (1x9M, 1x3M, and 1x1.7M), so any significant timing variations here are due to the file sizes.

All tests were executed on a pretty weak 660Mhz Pentium 3 machine, running FreeBSD. I used CRM114 version 20080326-BlameSentansoken with TRE 0.7.5 as the regexp engine, and SpamAssassin version 3.2.5 with the rules updated from updates.spamassassin.org. Both were installed from the FreeBSD ports collection with pretty much the default settings. SpamAssassin tests were done with SpamAssassin running as a daemon; messages were fed with spamc.


Test methodology


  1. I used my own version of the Python wrapper crm.py written by Sam Deane at Elegant Chaos to abstract the different CRM operations, and recorded the slowdown caused by this abstraction. An estimation of the time overhead introduced by calling everything via Python can be found in the first part of the LEARN stats

  2. The classifiers described in the CRM book were then used with their recommended flags, and they were fed 1000 Spam and 1000 Ham messages from the Trec Public Spam Corpus. Stats about the LEARN performance can be found in the second part of the LEARN stats

  3. The Privspam/Privham and Pubspam/Pubham pairs were then fed to the classifier via a small Python script, and Accuracy and Speed numbers were calculated. You can find the numbers in the CLASSIFY stats

  4. SpamAssassin was tested against the same Privspam/Privham/Pubspam/Pubham messages. I had to fiddle a bit to get easy-to-compare accuracy numbers as described in the SpamAssassin stats,but recording stats is a fiddly business anyways.

  5. Celebrate! (An often overlooked part of each test methodology)


An important note to make is that CRM114 is put in a disadvantage in this scenario in at least two ways.

First, SpamAssassin uses the latest rules from spamassassin.org, while CRM114 relies on learning from data which is at least 4 years old. This is very visible in the CRM114 results with the Privspam/Privham messages. The reason for having this disadvantage on board is that I simply have no contemporary, sorted bulk data to train CRM114 with.

Second, there was no on-error training implemented with CRM114. The idea behind the different kinds of such training is that you form a feedback to the CRM114 classifier, telling it it was wrong on some message. You can read more about the different feedback training methods starting on page 156 of the CRM114 Revealed book. Such kind of training is expected to significantly improve the accuracy of the CRM114 classifiers. However it goes beyond the scope of this evaluation.

Raw test output and Python scripts



You can take a look at the raw output I got from the different scripts I used for the evaluation. The output is prenseted in separate blog posts, as there is quite a lot of it:

  • The LEARN stats post contains information about the speed of the LEARN command in CRM114, when used with the different classifiers and when CRM is invoked via Python for each file. You will also find there a brief speed comparison of the different "modes" you could have CRM114 running in.

  • The CLASSIFY stats post lists the speed and accuracy numbers of the different classifiers. The output should be pretty clear.

  • The SpamAssassin stats part describes the speed and accuracy demonstrated by SpamAssassin when running against the same test messages


Here are the links to the Python scripts used in the evaluation:
  • morecrmp.py - this is a modified version of the original crm.py wrapper/library written by Sam Deane at Elegant Chaos. It has some additional features, and makes it easier to work with multiple classifiers

  • learndir.py - this is a command-line script which takes all files from a given folder and feeds them (with the LEARN command) to a CRM114 classifier with a preset category. It has a range of command-line arguments that you can see in it usage (e.g. by running it without any arguments)

  • mass_class.py - mass classification script. It takes no command line arguments - all configuration is done by editing the settings in the Python code. It goes over a list of classifiers, and measures their speed and accuract when classifying files from a different directories.



Conclusions




Speed


While I measured the LEARN performance of all classifiers described in the CRM book, I removed 3 of them from the actual CLASSIFY test - 'correlate' , 'winnow' and 'entropy' (because they were too slow, required on-error feedback to be sensible at all, or had problems running). These are somewhat "exotic" classifiers, and they are not expected to show particulary good results when sorting spam.

In all tests, I ran CRM114 through a Python wrapper, which is basically its slowest mode. Why Python? Well, Python gives some pretty nice extras, like for example a trivial way to "clusterize" the bulk of the processinng (e.g. with the Parallel Python lib). The other reason is that Python is far more convenient to me than directly writing crm code.

Even in this mode of operation, classifying a single message with CRM114 took around 0.1 - 0.3 seconds on average. SpamAssassin on the other hand, was running in its most efficient mode (with spamd/spamc and compiled rules), and it took about 0.8 - 0.9 seconds per message, increasing to 1.2 seconds for the Privham group (which is the only one that contains larger attachments). So, SpamAssassin is at least 3-9 times slower than any of the tested CRM114 classifiers, and its performance seems strongly impacted by larger files.

Brief CLASSIFY speed stats. The first number is the average time in seconds for the Pubspam/ Pubham/ Privspam processing. I am showing a single value for these 3 groups, as the time is basically the same for all of them. The second value is the average time-per-message for the Privham folder.

Filter and flagsPubspam/ Pubham/ PrivspamPrivham
OSB Unigram (Bayesian)0.130.17
OSBF Unique Microgroom0.140.17
OSB Unique Microgroom: 0.160.22
Hyperspace Unique 0.23 0.28
Hyperspace0.240.29
Markovian Microgroom (default) 0.27 (0.56)0.28
Spam Asassin: 0.841.16


There is some (consistent) noticeable slowdown with the Markovian classifier when running on the Pubham data. I have not investigated what is causing it.


Accuracy

To my personal surprise, Spam Assassin was pretty amazing when scoring Ham messages (both from Privham and Pubham). I really did not see that coming, but with the test data, setting a spam score threshold of 7 would result in 100% Ham recognition. Even with a threshold of 5, the Ham recognition would be 96-99%.

On the other hand SA showed poor accuracy when running on the Pubspam folder. Feeding SA with my handpicked selection of spam was even worse - more than 50% of the spam messages had a spam score of 8 or less, and 30% had a score below 5.

The CRM classifiers showed a different pattern. Almost all of the filters showed very good results with the Pubham/Pubspam batches. These messages are taken from the same corpus, as the messages used to train the classifiers, so this part of the results is closer to the real-world application. First came the 'OSBF' classifier (running with Unique and Microgroom flags), which scored 100% accuracy on the Pubham samples, and 96% accuracy on the Pubspam samples. This is quite good for a filter trained with just 1000 ham and 1000 spam samples. All other filters, except for the Hyperspace ones, showed pretty good results too - more than 95% ham recognition, and more than 85% spam recognition.

The Hyperspace classifier (with and without the "unique" flag) yielded some strange scores. Very poor ham recognition (64%) and prefect spam recognition (100%). It is possible that this is due to the lack of on-error training, as this filter is supposed to be as good as OSBF.

The accuracy of the CRM classifiers was a whole different matter when I tried them on my personal spam/ham. Again, this was somewhat expected - spam evolves fast, and the classifiers were never trained with spam messages like the one I asked them to recognize. I'd expect the Ham recognition rate to be higher, but most likely the weak result is due to the fact that a lot of my private hams are using cyrillic characters, and I doubt that there were many such legitimate messages in the learning samples.

The Hyperspace classifier is once again "against the grain". It shows a dismal ham recognition - 18%/25%, but pretty decent spam recognition - 88%/84%.

Here are the final accuracy stats for the different message groups:


Filter and flagsPubhamPubspamPrivhamPrivspam
OSBF Unique Microgroom 100%96% 73%64%
SA (threshold 6)100%86%98%67%
OSB Unigram (Bayesian)98%86%88%48%
OSB Unique Microgroom 96%88%73%61%
Markovian Microgroom (default) 94%97%68%67%
Hyperspace64%100%18%88%
Hyperspace unique64%100%25%84%




Final words


The posted numbers suggest that currenly, the best anti-spam classifier offered by CRM114 is OSBF. It demonstrated an excellent classification speed, and very good filtering accuracy.

It is clear that all CRM114 classifiers beat Spam Assassin in terms of speed. So, if you are looking for lightweight spam filtering, and especially if you are looking for something to substitute spam Assassin with, give CRM114 a go.

I cannot honestly conclude that the CRM114 classifiers are superior than SpamAssassin in every respect - SA showed top results when filtering ham messages, and it was unmatched on the Privham accuracy test. Still, the excellent Pubham/Pubspam CRM114 accuracy numbers suggest that a properly trained OSBF classifer would show at least the same level of ham recognition as SA, and a much better spam filtering rate. This comes along with a more lightweight implementation and "autimatic" improvement with each trained message, so it's no wonder I'm heavily biased towards CRM114.