The harder they come, the harder they fall: 2009

понеделник, 9 февруари 2009 г.

CRM114 Spam Filtering Evaluation

Basics

What is CRM114?

CRM114 is a script (JIT-compiled) language, created for the specific purpose of classifying (text) data. In other words CRM programs usually solve the problem: "Given categories X, Y, Z which category does this piece of data D belong to?". It just so happens that this question matches the problem of email spam filtering.

The place on the web where you can learn more about CRM114 is the official site and wiki. Another important piece of documentation is the CRM114 Revealed book, which, among other things, covers the classifiers that are readily available in the language. To read about the classifiers jump straight to page 159.

What is the purpose of this evaluation?

The puspose is to get an empirical idea of how fast and accurate are the classifiers provided by CRM114, compared to the quite popular Spam Assassin filter. Another point on the agenda is estimating the slowdown imposed by running CRM114 code via a Python wrapper.

Note that none of the data presented here pretends to be statistically precise. It is a rough estimate, which you can use as a base for deciding whether to try CRM114 for yourself or not.

Test data and environment used for the evaluation

The bulk of the data came from the 2005 Trec Public Spam Corpus. I used 1000 spam and 1000 ham messages from this corpus to train the different CRM114 classifiers. Before any training occurred, 405 ham messages and 561 spam messages were taken out of the Public corpus, to be used as test material. I will refer to the 405 test ham messages as Pubham, and to the 561 spam messages as Pubspam (these are also the names of the folders on my system where the messages got stored).

The public spam corpus provided by Trec is a great resource; however, it is quite old. So, I hand-picked 100 spam and 100 ham messages from my personal mailbox, so I can test the filters with some "modern" emails. While the 100 spams are all recent (all received in Jan, 2009), the 100 hams range from 2004 - 2009. I will refer to these two groups of messages as Privham and Privspam (these names also match the corresponding folders on my system). Privham is the only group that contains large attachments (1x9M, 1x3M, and 1x1.7M), so any significant timing variations here are due to the file sizes.

All tests were executed on a pretty weak 660Mhz Pentium 3 machine, running FreeBSD. I used CRM114 version 20080326-BlameSentansoken with TRE 0.7.5 as the regexp engine, and SpamAssassin version 3.2.5 with the rules updated from updates.spamassassin.org. Both were installed from the FreeBSD ports collection with pretty much the default settings. SpamAssassin tests were done with SpamAssassin running as a daemon; messages were fed with spamc.

Test methodology

I used my own version of the Python wrapper crm.py written by Sam Deane at Elegant Chaos to abstract the different CRM operations, and recorded the slowdown caused by this abstraction. An estimation of the time overhead introduced by calling everything via Python can be found in the first part of the LEARN stats

The classifiers described in the CRM book were then used with their recommended flags, and they were fed 1000 Spam and 1000 Ham messages from the Trec Public Spam Corpus. Stats about the LEARN performance can be found in the second part of the LEARN stats

The Privspam/Privham and Pubspam/Pubham pairs were then fed to the classifier via a small Python script, and Accuracy and Speed numbers were calculated. You can find the numbers in the CLASSIFY stats

SpamAssassin was tested against the same Privspam/Privham/Pubspam/Pubham messages. I had to fiddle a bit to get easy-to-compare accuracy numbers as described in the SpamAssassin stats,but recording stats is a fiddly business anyways.

Celebrate! (An often overlooked part of each test methodology)

An important note to make is that CRM114 is put in a disadvantage in this scenario in at least two ways.

First, SpamAssassin uses the latest rules from spamassassin.org, while CRM114 relies on learning from data which is at least 4 years old. This is very visible in the CRM114 results with the Privspam/Privham messages. The reason for having this disadvantage on board is that I simply have no contemporary, sorted bulk data to train CRM114 with.

Second, there was no on-error training implemented with CRM114. The idea behind the different kinds of such training is that you form a feedback to the CRM114 classifier, telling it it was wrong on some message. You can read more about the different feedback training methods starting on page 156 of the CRM114 Revealed book. Such kind of training is expected to significantly improve the accuracy of the CRM114 classifiers. However it goes beyond the scope of this evaluation.

Raw test output and Python scripts

You can take a look at the raw output I got from the different scripts I used for the evaluation. The output is prenseted in separate blog posts, as there is quite a lot of it:

The LEARN stats post contains information about the speed of the LEARN command in CRM114, when used with the different classifiers and when CRM is invoked via Python for each file. You will also find there a brief speed comparison of the different "modes" you could have CRM114 running in.

The CLASSIFY stats post lists the speed and accuracy numbers of the different classifiers. The output should be pretty clear.

The SpamAssassin stats part describes the speed and accuracy demonstrated by SpamAssassin when running against the same test messages

Here are the links to the Python scripts used in the evaluation:

morecrmp.py - this is a modified version of the original crm.py wrapper/library written by Sam Deane at Elegant Chaos. It has some additional features, and makes it easier to work with multiple classifiers

learndir.py - this is a command-line script which takes all files from a given folder and feeds them (with the LEARN command) to a CRM114 classifier with a preset category. It has a range of command-line arguments that you can see in it usage (e.g. by running it without any arguments)

mass_class.py - mass classification script. It takes no command line arguments - all configuration is done by editing the settings in the Python code. It goes over a list of classifiers, and measures their speed and accuract when classifying files from a different directories.

Conclusions

Speed

While I measured the LEARN performance of all classifiers described in the CRM book, I removed 3 of them from the actual CLASSIFY test - 'correlate' , 'winnow' and 'entropy' (because they were too slow, required on-error feedback to be sensible at all, or had problems running). These are somewhat "exotic" classifiers, and they are not expected to show particulary good results when sorting spam.

In all tests, I ran CRM114 through a Python wrapper, which is basically its slowest mode. Why Python? Well, Python gives some pretty nice extras, like for example a trivial way to "clusterize" the bulk of the processinng (e.g. with the Parallel Python lib). The other reason is that Python is far more convenient to me than directly writing crm code.

Even in this mode of operation, classifying a single message with CRM114 took around 0.1 - 0.3 seconds on average. SpamAssassin on the other hand, was running in its most efficient mode (with spamd/spamc and compiled rules), and it took about 0.8 - 0.9 seconds per message, increasing to 1.2 seconds for the Privham group (which is the only one that contains larger attachments). So, SpamAssassin is at least 3-9 times slower than any of the tested CRM114 classifiers, and its performance seems strongly impacted by larger files.

Brief CLASSIFY speed stats. The first number is the average time in seconds for the Pubspam/ Pubham/ Privspam processing. I am showing a single value for these 3 groups, as the time is basically the same for all of them. The second value is the average time-per-message for the Privham folder.

Filter and flags	Pubspam/ Pubham/ Privspam	Privham
OSB Unigram (Bayesian)	0.13	0.17
OSBF Unique Microgroom	0.14	0.17
OSB Unique Microgroom:	0.16	0.22
Hyperspace Unique	0.23	0.28
Hyperspace	0.24	0.29
Markovian Microgroom (default)	0.27 (0.56)	0.28
Spam Asassin:	0.84	1.16

There is some (consistent) noticeable slowdown with the Markovian classifier when running on the Pubham data. I have not investigated what is causing it.

Accuracy

To my personal surprise, Spam Assassin was pretty amazing when scoring Ham messages (both from Privham and Pubham). I really did not see that coming, but with the test data, setting a spam score threshold of 7 would result in 100% Ham recognition. Even with a threshold of 5, the Ham recognition would be 96-99%.

On the other hand SA showed poor accuracy when running on the Pubspam folder. Feeding SA with my handpicked selection of spam was even worse - more than 50% of the spam messages had a spam score of 8 or less, and 30% had a score below 5.

The CRM classifiers showed a different pattern. Almost all of the filters showed very good results with the Pubham/Pubspam batches. These messages are taken from the same corpus, as the messages used to train the classifiers, so this part of the results is closer to the real-world application. First came the 'OSBF' classifier (running with Unique and Microgroom flags), which scored 100% accuracy on the Pubham samples, and 96% accuracy on the Pubspam samples. This is quite good for a filter trained with just 1000 ham and 1000 spam samples. All other filters, except for the Hyperspace ones, showed pretty good results too - more than 95% ham recognition, and more than 85% spam recognition.

The Hyperspace classifier (with and without the "unique" flag) yielded some strange scores. Very poor ham recognition (64%) and prefect spam recognition (100%). It is possible that this is due to the lack of on-error training, as this filter is supposed to be as good as OSBF.

The accuracy of the CRM classifiers was a whole different matter when I tried them on my personal spam/ham. Again, this was somewhat expected - spam evolves fast, and the classifiers were never trained with spam messages like the one I asked them to recognize. I'd expect the Ham recognition rate to be higher, but most likely the weak result is due to the fact that a lot of my private hams are using cyrillic characters, and I doubt that there were many such legitimate messages in the learning samples.

The Hyperspace classifier is once again "against the grain". It shows a dismal ham recognition - 18%/25%, but pretty decent spam recognition - 88%/84%.

Here are the final accuracy stats for the different message groups:

Filter and flags	Pubham	Pubspam	Privham	Privspam
OSBF Unique Microgroom	100%	96%	73%	64%
SA (threshold 6)	100%	86%	98%	67%
OSB Unigram (Bayesian)	98%	86%	88%	48%
OSB Unique Microgroom	96%	88%	73%	61%
Markovian Microgroom (default)	94%	97%	68%	67%
Hyperspace	64%	100%	18%	88%
Hyperspace unique	64%	100%	25%	84%

Final words

The posted numbers suggest that currenly, the best anti-spam classifier offered by CRM114 is OSBF. It demonstrated an excellent classification speed, and very good filtering accuracy.

It is clear that all CRM114 classifiers beat Spam Assassin in terms of speed. So, if you are looking for lightweight spam filtering, and especially if you are looking for something to substitute spam Assassin with, give CRM114 a go.

I cannot honestly conclude that the CRM114 classifiers are superior than SpamAssassin in every respect - SA showed top results when filtering ham messages, and it was unmatched on the Privham accuracy test. Still, the excellent Pubham/Pubspam CRM114 accuracy numbers suggest that a properly trained OSBF classifer would show at least the same level of ham recognition as SA, and a much better spam filtering rate. This comes along with a more lightweight implementation and "autimatic" improvement with each trained message, so it's no wonder I'm heavily biased towards CRM114.

сряда, 4 февруари 2009 г.

SpamAssassin stats

(this document is a part of a larger spam filtering evaluation)
Time Stats

Spam Assassin executed with compiled rules and default settings on a FreeBSD. No Bayesian filtering, auto whitelisting or balcklisting used. This script is executed for each of the test message folders (pubham, pubspam/, privham/ and privspam/). It feeds all files inside the folder to spamc, and records the appropriate message score in the output file.

$time for i in *; do spamc -c < $i >> ./spamass/pubham.sa.txt ; done

real 5m26.087s
user 0m2.247s
sys 0m4.061s

$time for i in *; do spamc -c < $i >> ./spamass/pubspam.sa.txt ; done

real 7m59.680s
user 0m3.436s
sys 0m5.371s

$time for i in *; do spamc -c < "$i" >> ./spamass/privspam.sa.txt ; done

real 1m37.555s
user 0m0.601s
sys 0m0.937s

$time for i in *; do spamc -c < "$i" >> ./spamass/privham.sa.txt ; done

real 1m56.154s
user 0m0.480s
sys 0m1.110s

Each message generated a single stat score in the output file, so we can easily check how many messages were processed:

$wc -l *
100 privham.sa.txt
100 privspam.sa.txt
405 pubham.sa.txt
561 pubspam.sa.txt

So here are the computed (real) time stats per message for each of the groups:
Pubham: 0.80 sec per message
Pubspam: 0.85 sec per message
Privspam: 0.87 sec per message
Privham: 1.16 sec per message

This is about 3-10 times slower than running the different CRM classifiers via Python (which means that it is even more slower when compared to a "pure" CRM program).

Accuracy stats

The output files I got after running the time stats scripts contain Spam Assassin scores like:

...
8.4/5.0
6.3/5.0
-6.0/5.0
-5.6/5.0
0.9/5.0
...

I'm not really interested in the default threshold value (the '/5.0' part). I want to see what would be the accuracy with different between 5 and 10. So, a small awk script is in order to group the score into different thresholds.

$cat spamass/pubham.sa.txt | awk -F '/' '{if (int($1)<5){score["subfive"]++}; if (int($1)>=10) {score["tenplus"]++} if (int($1)>=5 && int($1) < 10) {score[int($1)]++} } END {for (s in score) {print "Score " s " : " score[s]} }'
Score 5 : 2
Score subfive : 403

$cat spamass/pubspam.sa.txt | awk -F '/' '{if (int($1)<5){score["subfive"]++}; if (int($1)>=10) {score["tenplus"]++} if (int($1)>=5 && int($1) < 10) {score[int($1)]++} } END {for (s in score) {print "Score " s " : " score[s]} }'
Score 5 : 15
Score 6 : 29
Score 7 : 24
Score 8 : 8
Score 9 : 39
Score tenplus : 400
Score subfive : 46

$cat spamass/privham.sa.txt | awk -F '/' '{if (int($1)<5){score["subfive"]++}; if (int($1)>=10) {score["tenplus"]++} if (int($1)>=5 && int($1) < 10) {score[int($1)]++} } END {for (s in score) {print "Score " s " : " score[s]} }'
Score 5 : 2
Score 6 : 2
Score subfive : 96

$cat spamass/privspam.sa.txt | awk -F '/' '{if (int($1)<5){score["subfive"]++}; if (int($1)>=10) {score["tenplus"]++} if (int($1)>=5 && int($1) < 10) {score[int($1)]++} } END {for (s in score) {print "Score " s " : " score[s]} }'
Score 5 : 3
Score 6 : 10
Score 7 : 10
Score 8 : 6
Score 9 : 4
Score tenplus : 37
Score subfive : 30

OK, by this point I got sick of scripts. Some manual calculations to relate the number of messages to the accuracy with the corresponding tresholds:

Pubham:
Score subfive : 403
Threshold 5 (all messages with higher score are blocked): 2 hams blocked (Accuracy 99.9%)
Threshold 6 and more: 0 hams blocked (Accuracy 100%)

Pubspam:
Score subfive : 46 (spams with score below 5)
Threshold 5 ( spams with lower score are let in): 46 spams missed (Accuracy 92%)
Threshold 6 : 61 spams missed (Accuracy 86%)
Threshold 7 : 90 spams missed (Accuracy 80%)
Threshold 8 : 114 spams missed (Accuracy 75%)
Threshold 9 : 122 spams missed (Accuracy 73%)
Threshold 10: 161 spams missed (Accuracy 65%)
Score above 10: 400 spams

Privham:
Score subfive : 96
Treshold 5 (messages with higher score are blocked): 4 hams blocked (Accuracy 96%)
Treshold 6 : 2 hams blocked (Accuracy 98%)
Threshold 7 and more: 0 hams blocked (Accuracy 100%)

Privspam:
Score subfive : 30
Score 5 (spams with lower score are let in) :30 spams missed (Accuracy 70%)
Score 6 : 33 spams missed (Accuracy 67%)
Score 7 : 43 spams missed (Accuracy 57%)
Score 8 : 53 spams missed (Accuracy 47%)
Score 9 : 59 spams missed (Accuracy 41%)
Score 10: 63 spams missed (Accuracy 37%)
Score above 10 : 37

Pretty awesome ham recognition and quite weak spam filtering within the tested spam score levels (5-10). According to the Spam Assassin docs, setting the treshold to 5 is pretty aggressive. However, this does not seem to be the case according to the tests above.

CRM114 Classify stats

(this document is a part of a larger spam filtering evaluation)

Note: I had to increase the default window size of crm (with the -w option) to be able to run through some big emails with attachments.

Classifier stats on Spam/Ham messages taken from the Pubham/Pubspam archive:
$./mass_class.py

Classifier: 'microgroom' Cat: ham T: 112.16 sec Samples: 405 Avg T/s: 0.28 Accuracy: 0.94 (24 misses)
Classifier: 'microgroom' Cat: spam T: 141.40 sec Samples: 561 Avg T/s: 0.25 Accuracy: 0.97 (17 misses)
Classifier: 'osb unigram' Cat: ham T: 54.26 sec Samples: 405 Avg T/s: 0.13 Accuracy: 0.98 (7 misses)
Classifier: 'osb unigram' Cat: spam T: 74.24 sec Samples: 561 Avg T/s: 0.13 Accuracy: 0.86 (76 misses)
Classifier: 'osb unique microgroom' Cat: ham T: 62.81 sec Samples: 405 Avg T/s: 0.16 Accuracy: 0.96 (16 misses)
Classifier: 'osb unique microgroom' Cat: spam T: 84.86 sec Samples: 561 Avg T/s: 0.15 Accuracy: 0.88 (70 misses)
Classifier: 'osbf unique microgroom' Cat: ham T: 55.84 sec Samples: 405 Avg T/s: 0.14 Accuracy: 1.00 (0 misses)
Classifier: 'osbf unique microgroom' Cat: spam T: 75.64 sec Samples: 561 Avg T/s: 0.13 Accuracy: 0.96 (25 misses)
Classifier: 'hyperspace' Cat: ham T: 97.14 sec Samples: 405 Avg T/s: 0.24 Accuracy: 0.64 (146 misses)
Classifier: 'hyperspace' Cat: spam T: 133.04 sec Samples: 561 Avg T/s: 0.24 Accuracy: 1.00 (0 misses)
Classifier: 'hyperspace unique' Cat: ham T: 93.08 sec Samples: 405 Avg T/s: 0.23 Accuracy: 0.64 (145 misses)
Classifier: 'hyperspace unique' Cat: spam T: 126.90 sec Samples: 561 Avg T/s: 0.23 Accuracy: 1.00 (0 misses)

Tests with 100 Spam and 100 Ham messages taken from my private mailbox (Privham/Privspam):

./mass_class.py
Classifier: 'microgroom' Cat: ham T: 55.73 sec Samples: 100 Avg T/s: 0.56 Accuracy: 0.68 (32 misses)
Classifier: 'microgroom' Cat: spam T: 27.67 sec Samples: 100 Avg T/s: 0.28 Accuracy: 0.67 (33 misses)
Classifier: 'osb unigram' Cat: ham T: 17.00 sec Samples: 100 Avg T/s: 0.17 Accuracy: 0.88 (12 misses)
Classifier: 'osb unigram' Cat: spam T: 14.06 sec Samples: 100 Avg T/s: 0.14 Accuracy: 0.48 (52 misses)
Classifier: 'osb unique microgroom' Cat: ham T: 22.05 sec Samples: 100 Avg T/s: 0.22 Accuracy: 0.73 (27 misses)
Classifier: 'osb unique microgroom' Cat: spam T: 16.16 sec Samples: 100 Avg T/s: 0.16 Accuracy: 0.61 (39 misses)
Classifier: 'osbf unique microgroom' Cat: ham T: 17.17 sec Samples: 100 Avg T/s: 0.17 Accuracy: 0.73 (27 misses)
Classifier: 'osbf unique microgroom' Cat: spam T: 14.49 sec Samples: 100 Avg T/s: 0.14 Accuracy: 0.64 (36 misses)
Classifier: 'hyperspace' Cat: ham T: 28.90 sec Samples: 100 Avg T/s: 0.29 Accuracy: 0.18 (82 misses)
Classifier: 'hyperspace' Cat: spam T: 24.56 sec Samples: 100 Avg T/s: 0.25 Accuracy: 0.88 (12 misses)
Classifier: 'hyperspace unique' Cat: ham T: 27.81 sec Samples: 100 Avg T/s: 0.28 Accuracy: 0.25 (75 misses)
Classifier: 'hyperspace unique' Cat: spam T: 23.51 sec Samples: 100 Avg T/s: 0.24 Accuracy: 0.84 (16 misses)

вторник, 3 февруари 2009 г.

CRM114 LEARN speed stats

(this document is a part of a larger spam filtering evaluation)

Starting with the conclusions. Before looking at the absolute times, have in mind that all commands in the current test are ran on a pretty weak machine - 660 Mhz Intel Pentium 3 (with FreeBSD on it). You are bound to get much better times on any contemporary hardware.

Running CRM in a Python wrapper vs. Multiple CRM executions vs. Single CRM Execution

Using CRM114 via Python is quite cool. It really simplifies the process, and unlocks some features like easily running the actual CRM code on a remote machine for example. Of course, this all comes at a cost. What I tried to evaluate is basically the slowdown imposed by wrapping the `crm` binary in Python code.

Turns out that with 100 files, the morecrm.py library is currently about 2 times slower than executing each LEARN command in a separate crm process (as seen in the execution times of learndir.py and test.sh).

I'm guessing that 'multiple crm executions' is the actual mode most of the real-life CRM114 installatins are running in. The better (and more complicated) alternative is CRM114 running as a daemon, either with looping or spawning (forking). Looping is the fastest possible mode you can have; so I was curious how much slower was the Python wrapper than the fastest mode. To test this I put all LEARN statements in a single .crm file, where the crm interpreter was started just once. Results showed that this is about 4 times faster than the morecrm.py lib. There was no loop per se, so this was a very little bit faster than the standard looping implementation in pure CRM114 style.

So, according to my tests:
Pseudo-looping: Fastest option ever
Separate `crm` calls: about 1.5-2 times slower than pseudo-looping
Separate `crm` calls with learndir.py/morecrm.py: roughly 3-4 times slower than pseudo-looping

With crm being as efficient as it is, slowing down 3-4 times because of Python will most likely be a non-issue for most applications. Given the extra simplicity and opportunities offered by Python, I'd say that this is a quite viable deployment scenario. Of course, if you aim for maximum throughput, you will have to go with daemonized crm "all the way down" (tm).

LEARN speed of the different CRM114 classifiers

This second group of tests aims to evaluate the LEARNing speed of the different classifiers built in CRM114. The times are measured when LEARNing 1000 messages via learndir.py/morecrm.py

It is worth noting that the ham messages tend to be larger than spam messages. This is why there is sometimes significant difference in the spam/ham learn times (this difference persists in multiple executions).

The classifier/flag combinations with their average times are :
Markovian with Microgroom: 60ms per spam / 120ms per ham
OSB with Unique Microgroom: 46 ms per spam/ 53 ms per ham
Bayesian (a.k.a. OSB Unigram): 35 ms per spam / 37 ms per ham
OSBF with Unique Microgroom: 43 ms per spam / 49 ms per ham
Winnow with Unique Microgroom: 57 ms per spam / 61 ms per ham
Hyperspace: 52 ms per spam/ 54ms per ham
Hyperspace with Unique: 52 ms per spam/ 54ms per ham
Correlate: 31 ms per spam/ 31 ms per ham
Entropy with Unique Crosslink: 370 ms per spam/ 1 sec per ham !

Things to note:

The Correlate classifier is the fastest one in this test, closely followed by pure Bayesian
The Entropy classifier with Unique and Crosslink is extremely slow - about 10 times slower than the others. Note that this classifier is marked as Experimental in the last edition of CRM114 Revealed
The default Markovian classifier shows quite different results with the different groups - its speed seems to vary a lot depending on the size of the input
All other classifiers show comparable results in terms of speed - 40-60 ms per message

Caveat (one of them): This test is performed with a separate crm execution (through the Python wrapper) for each message. This means that the corresponding database file is loaded again and again for each file. So, these results are impacted by the sizes of the database files used by each classifier. If crm is running as a daemon (where the database files are loaded just once) you could potentially get drastically different results.

STATS

Running CRM114 LEARN on 100 email messages with learndir.py/morecrm.py

$time ./learndir.py corpus/trec05p-1/spamlinks/ spam -T 'osbf' -c 100
Using classifier string: <osbf>
Fed 100 files to the category spam
real 0m4.153s
user 0m1.910s
sys 0m1.999s

Running the same 100 CRM114 commands ('crm' is invoked for each file) with a bash script:

$time bash test.sh
real 0m2.697s
user 0m1.516s
sys 0m0.979s

Running the INPUT/LEARN commands multiple times within the as a single .crm script (pseudo-loop mode):

$time crm test.crm
real 0m1.047s
user 0m0.732s
sys 0m0.300s

This is about 4 times faster than with the Python wrapper, and 2.5 times faster than with separate 'crm' calls. I believe this is as fast as CRM114 can process these messages with the 'osbf' classifier on my puny FreeBSD router.

Tests with 1000 messages (just for pseudo-loop and Python modes):

$time ./learndir.py ./corpus/trec05p-1/spamlinks/ spam -T 'osbf microgroom' -c 1000
Using classifier string: <osbf microgroom>
Fed 1000 files to the category spam
real 0m46.710s
user 0m24.403s
sys 0m19.201s

$time crm test1k.crm
real 0m15.940s
user 0m12.504s
sys 0m3.220s

In this test Python took 3 times longer than pseudo-loop mode (down from 4 times with 100 messages). This is likely due to the lower impact of the initial setup overhead of the Python process. Note that I used the 'microgroom' flag here, as otherwise OSBF complains its data file is full.

Stats for learning 1000 messages with the Python wrapper

Default (Markovian) classifier with flag "microgroom":

time ./learndir.py ./corpus/trec05p-1/spamlinks/ spam -d 1k/ -c 1000
Using classifier string: <microgroom>
Fed 1000 files to the category spam

real 1m1.226s
user 0m20.341s
sys 0m37.869s

time ./learndir.py ./corpus/trec05p-1/hamlinks/ ham -d 1k/ -c 1000
Using classifier string: <microgroom>
Fed 1000 files to the category ham

real 2m3.620s
user 1m18.086s
sys 0m38.660s

OSB classifier with flags UNIQUE MICROGROOM:

time ./learndir.py ./corpus/trec05p-1/spamlinks/ spam -d 1k/ -c 1000 -T 'osb unique microgroom'
Using classifier string: <osb unique microgroom>
Fed 1000 files to the category spam

real 0m46.528s
user 0m15.807s
sys 0m26.076s

time ./learndir.py ./corpus/trec05p-1/hamlinks/ ham -d 1k/ -c 1000 -T 'osb unique microgroom'
Using classifier string: <osb unique microgroom>
Fed 1000 files to the category ham

real 0m53.929s
user 0m22.984s
sys 0m25.178s

Standard Bayesian classifier (a.k.a. OSB UNIGRAM):

time ./learndir.py ./corpus/trec05p-1/spamlinks/ spam -d 1k/ -c 1000 -T 'osb unigram'
Using classifier string: <osb unigram>
Fed 1000 files to the category spam

real 0m35.699s
user 0m13.195s
sys 0m20.058s

time ./learndir.py ./corpus/trec05p-1/hamlinks/ ham -d 1k/ -c 1000 -T 'osb unigram'
Using classifier string: <osb unigram>
Fed 1000 files to the category ham

real 0m37.241s
user 0m14.743s
sys 0m19.087s

OSBF classifier with flags UNIQUE MICROGROOM:

time ./learndir.py ./corpus/trec05p-1/spamlinks/ spam -d 1k/ -c 1000 -T 'osbf unique microgroom'
Using classifier string: <osbf unique microgroom>
Fed 1000 files to the category spam

real 0m43.295s
user 0m21.408s
sys 0m18.552s

time ./learndir.py./corpus/trec05p-1/hamlinks/ ham -d 1k/ -c 1000 -T 'osbf unique microgroom'
Using classifier string: <osbf unique microgroom>
Fed 1000 files to the category ham

real 0m49.326s
user 0m28.144s
sys 0m17.561s

WINNOW classifier with flags UNIQUE MICROGROOM:

time ./learndir.py ./corpus/trec05p-1/spamlinks/ spam -d 1k/ -c 1000 -T 'winnow unique microgroom'
Using classifier string: <winnow unique microgroom>
Fed 1000 files to the category spam

real 0m56.936s
user 0m17.787s
sys 0m31.420s

time ./learndir.py ./corpus/trec05p-1/hamlinks/ ham -d 1k/ -c 1000 -T 'winnow unique microgroom'
Using classifier string: <winnow unique microgroom>
Fed 1000 files to the category ham

real 1m0.949s
user 0m21.654s
sys 0m30.937s

HYPERSPACE classifier with no flags:

time ./learndir.py ./corpus/trec05p-1/spamlinks/ spam -d 1k/ -c 1000 -T 'hyperspace'
Using classifier string: <hyperspace>
Fed 1000 files to the category spam

real 0m52.581s
user 0m18.334s
sys 0m21.070s

time ./learndir.py ./corpus/trec05p-1/hamlinks/ ham -d 1k/ -c 1000 -T 'hyperspace'
Using classifier string: <hyperspace>
Fed 1000 files to the category ham

real 0m54.801s
user 0m21.121s
sys 0m19.702s

HYPERSPACE classifier with flag UNIQUE:

time ./learndir.py ./corpus/trec05p-1/spamlinks/ spam -d 1k/ -c 1000 -T 'hyperspace unique'
Using classifier string: <hyperspace unique>
Fed 1000 files to the category spam

real 0m52.495s
user 0m18.014s
sys 0m21.557s

time ./learndir.py ./corpus/trec05p-1/hamlinks/ ham -d 1k/ -c 1000 -T 'hyperspace unique'
Using classifier string: <hyperspace unique>
Fed 1000 files to the category ham

real 0m54.831s
user 0m20.466s
sys 0m20.370s

CORRELATE classifier with no flags:

time ./learndir.py ./corpus/trec05p-1/spamlinks/ spam -d 1k/ -c 1000 -T 'correlate'
Using classifier string: <correlate>
Fed 1000 files to the category spam

real 0m30.776s
user 0m12.605s
sys 0m15.490s

time ./learndir.py ./corpus/trec05p-1/hamlinks/ ham -d 1k/ -c 1000 -T 'correlate'
Using classifier string: <correlate>
Fed 1000 files to the category ham

real 0m31.030s
user 0m12.653s
sys 0m15.406s

ENTROPY classifier with flags UNIQUE CROSSLINK:

time ./learndir.py ./corpus/trec05p-1/spamlinks/ spam -d 1k/ -c 1000 -T 'entropy unique crosslink'
Using classifier string: <entropy unique crosslink>
Fed 1000 files to the category spam

real 6m10.947s
user 4m51.545s
sys 1m5.383s

time ./learndir.py ./corpus/trec05p-1/hamlinks/ ham -d 1k/ -c 1000 -T 'entropy unique crosslink'
Using classifier string: <entropy unique crosslink>
Fed 1000 files to the category ham

real 16m46.975s
user 14m43.170s
sys 1m20.278s

DISK SPACE USAGE WITH THE DEFAULT SLOTCOUNT SETTING

This is the size of the folders corresponding to the different classifiers. Each folder contains exactly two files, so divide the listed size by two to get the size of the storage file.

22M correlate
70M entropy_unique_crosslink
14M hyperspace
13M hyperspace_unique
24M microgroom
12M osb_unigram
12M osb_unique_microgroom
2.2M osbf_unique_microgroom
24M winnow_unique_microgroom

The harder they come, the harder they fall

понеделник, 9 февруари 2009 г.

CRM114 Spam Filtering Evaluation

сряда, 4 февруари 2009 г.

SpamAssassin stats

CRM114 Classify stats

вторник, 3 февруари 2009 г.

CRM114 LEARN speed stats

Архив на блога

Всичко за мен

The harder they come, the harder they fall

понеделник, 9 февруари 2009 г.

CRM114 Spam Filtering Evaluation

сряда, 4 февруари 2009 г.

SpamAssassin stats

CRM114 Classify stats

вторник, 3 февруари 2009 г.

CRM114 LEARN speed stats

Архив на блога

Всичко за мен

понеделник, 9 февруари 2009 г.

сряда, 4 февруари 2009 г.

вторник, 3 февруари 2009 г.