вторник, 3 февруари 2009 г.

CRM114 LEARN speed stats

(this document is a part of a larger spam filtering evaluation)

Starting with the conclusions. Before looking at the absolute times, have in mind that all commands in the current test are ran on a pretty weak machine - 660 Mhz Intel Pentium 3 (with FreeBSD on it). You are bound to get much better times on any contemporary hardware.

Running CRM in a Python wrapper vs. Multiple CRM executions vs. Single CRM Execution

Using CRM114 via Python is quite cool. It really simplifies the process, and unlocks some features like easily running the actual CRM code on a remote machine for example. Of course, this all comes at a cost. What I tried to evaluate is basically the slowdown imposed by wrapping the `crm` binary in Python code.

Turns out that with 100 files, the morecrm.py library is currently about 2 times slower than executing each LEARN command in a separate crm process (as seen in the execution times of learndir.py and test.sh).

I'm guessing that 'multiple crm executions' is the actual mode most of the real-life CRM114 installatins are running in. The better (and more complicated) alternative is CRM114 running as a daemon, either with looping or spawning (forking). Looping is the fastest possible mode you can have; so I was curious how much slower was the Python wrapper than the fastest mode. To test this I put all LEARN statements in a single .crm file, where the crm interpreter was started just once. Results showed that this is about 4 times faster than the morecrm.py lib. There was no loop per se, so this was a very little bit faster than the standard looping implementation in pure CRM114 style.

So, according to my tests:
Pseudo-looping: Fastest option ever
Separate `crm` calls: about 1.5-2 times slower than pseudo-looping
Separate `crm` calls with learndir.py/morecrm.py: roughly 3-4 times slower than pseudo-looping

With crm being as efficient as it is, slowing down 3-4 times because of Python will most likely be a non-issue for most applications. Given the extra simplicity and opportunities offered by Python, I'd say that this is a quite viable deployment scenario. Of course, if you aim for maximum throughput, you will have to go with daemonized crm "all the way down" (tm).

LEARN speed of the different CRM114 classifiers

This second group of tests aims to evaluate the LEARNing speed of the different classifiers built in CRM114. The times are measured when LEARNing 1000 messages via learndir.py/morecrm.py


It is worth noting that the ham messages tend to be larger than spam messages. This is why there is sometimes significant difference in the spam/ham learn times (this difference persists in multiple executions).

The classifier/flag combinations with their average times are :
Markovian with Microgroom: 60ms per spam / 120ms per ham
OSB with Unique Microgroom: 46 ms per spam/ 53 ms per ham
Bayesian (a.k.a. OSB Unigram): 35 ms per spam / 37 ms per ham
OSBF with Unique
Microgroom: 43 ms per spam / 49 ms per ham
Winnow with Unique Microgroom: 57 ms per spam / 61 ms per ham
Hyperspace: 52 ms per spam/ 54ms per ham
Hyperspace with Unique:
52 ms per spam/ 54ms per ham
Correlate: 31 ms per spam/ 31 ms per ham
Entropy with Unique Crosslink: 370 ms per spam/ 1 sec per ham !


Things to note:
  • The Correlate classifier is the fastest one in this test, closely followed by pure Bayesian
  • The Entropy classifier with Unique and Crosslink is extremely slow - about 10 times slower than the others. Note that this classifier is marked as Experimental in the last edition of CRM114 Revealed
  • The default Markovian classifier shows quite different results with the different groups - its speed seems to vary a lot depending on the size of the input
  • All other classifiers show comparable results in terms of speed - 40-60 ms per message

Caveat (one of them): This test is performed with a separate crm execution (through the Python wrapper) for each message. This means that the corresponding database file is loaded again and again for each file. So, these results are impacted by the sizes of the database files used by each classifier. If crm is running as a daemon (where the database files are loaded just once) you could potentially get drastically different results.

STATS

Running CRM114 LEARN on 100 email messages with learndir.py/morecrm.py

$time ./learndir.py corpus/trec05p-1/spamlinks/ spam -T 'osbf' -c 100
Using classifier string: <osbf>
Fed 100 files to the category spam
real 0m4.153s

user 0m1.910s
sys 0m1.999s

Running the same 100 CRM114 commands ('crm' is invoked for each file) with a bash script:

$time bash test.sh
real 0m2.697s

user 0m1.516s
sys 0m0.979s
Running the INPUT/LEARN commands multiple times within the as a single .crm script (pseudo-loop mode):

$time crm test.crm
real 0m1.047s

user 0m0.732s
sys 0m0.300s
This is about 4 times faster than with the Python wrapper, and 2.5 times faster than with separate 'crm' calls. I believe this is as fast as CRM114 can process these messages with the 'osbf' classifier on my puny FreeBSD router.

Tests with 1000 messages (just for pseudo-loop and Python modes):

$time ./learndir.py ./corpus/trec05p-1/spamlinks/ spam -T 'osbf microgroom' -c 1000
Using classifier string: <osbf microgroom>
Fed 1000 files to the category spam
real 0m46.710s
user 0m24.403s
sys 0m19.201s

$time crm test1k.crm
real 0m15.940s
user 0m12.504s
sys 0m3.220s

In this test Python took 3 times longer than pseudo-loop mode (down from 4 times with 100 messages). This is likely due to the lower impact of the initial setup overhead of the Python process. Note that I used the 'microgroom' flag here, as otherwise OSBF complains its data file is full.

Stats for learning 1000 messages with the Python wrapper


Default (Markovian) classifier with flag "microgroom":


time ./learndir.py ./corpus/trec05p-1/spamlinks/ spam -d 1k/ -c 1000
Using classifier string: <microgroom>
Fed 1000 files to the category spam

real 1m1.226s
user 0m20.341s
sys 0m37.869s

time ./learndir.py ./corpus/trec05p-1/hamlinks/ ham -d 1k/ -c 1000
Using classifier string: <microgroom>
Fed 1000 files to the category ham

real 2m3.620s
user 1m18.086s
sys 0m38.660s

OSB classifier with flags UNIQUE MICROGROOM:

time ./learndir.py ./corpus/trec05p-1/spamlinks/ spam -d 1k/ -c 1000 -T 'osb unique microgroom'
Using classifier string: <osb unique microgroom>
Fed 1000 files to the category spam

real 0m46.528s
user 0m15.807s
sys 0m26.076s

time ./learndir.py ./corpus/trec05p-1/hamlinks/ ham -d 1k/ -c 1000 -T 'osb unique microgroom'
Using classifier string: <osb unique microgroom>
Fed 1000 files to the category ham

real 0m53.929s
user 0m22.984s
sys 0m25.178s

Standard Bayesian classifier (a.k.a. OSB UNIGRAM):

time ./learndir.py ./corpus/trec05p-1/spamlinks/ spam -d 1k/ -c 1000 -T 'osb unigram'
Using classifier string: <osb unigram>
Fed 1000 files to the category spam

real 0m35.699s
user 0m13.195s
sys 0m20.058s

time ./learndir.py ./corpus/trec05p-1/hamlinks/ ham -d 1k/ -c 1000 -T 'osb unigram'
Using classifier string: <osb unigram>
Fed 1000 files to the category ham

real 0m37.241s
user 0m14.743s
sys 0m19.087s

OSBF classifier with flags UNIQUE MICROGROOM
:

time ./learndir.py ./corpus/trec05p-1/spamlinks/ spam -d 1k/ -c 1000 -T 'osbf unique microgroom'
Using classifier string: <osbf unique microgroom>
Fed 1000 files to the category spam

real 0m43.295s
user 0m21.408s
sys 0m18.552s

time ./learndir.py./corpus/trec05p-1/hamlinks/ ham -d 1k/ -c 1000 -T 'osbf unique microgroom'
Using classifier string: <osbf unique microgroom>
Fed 1000 files to the category ham

real 0m49.326s
user 0m28.144s
sys 0m17.561s

WINNOW classifier with flags UNIQUE MICROGROOM:

time ./learndir.py ./corpus/trec05p-1/spamlinks/ spam -d 1k/ -c 1000 -T 'winnow unique microgroom'
Using classifier string: <winnow unique microgroom>
Fed 1000 files to the category spam

real 0m56.936s
user 0m17.787s
sys 0m31.420s

time ./learndir.py ./corpus/trec05p-1/hamlinks/ ham -d 1k/ -c 1000 -T 'winnow unique microgroom'
Using classifier string: <winnow unique microgroom>
Fed 1000 files to the category ham

real 1m0.949s
user 0m21.654s
sys 0m30.937s

HYPERSPACE classifier with no flags:

time ./learndir.py ./corpus/trec05p-1/spamlinks/ spam -d 1k/ -c 1000 -T 'hyperspace'
Using classifier string: <hyperspace>
Fed 1000 files to the category spam

real 0m52.581s
user 0m18.334s
sys 0m21.070s

time ./learndir.py ./corpus/trec05p-1/hamlinks/ ham -d 1k/ -c 1000 -T 'hyperspace'
Using classifier string: <hyperspace>
Fed 1000 files to the category ham

real 0m54.801s
user 0m21.121s
sys 0m19.702s


HYPERSPACE classifier with flag UNIQUE:

time ./learndir.py ./corpus/trec05p-1/spamlinks/ spam -d 1k/ -c 1000 -T 'hyperspace unique'
Using classifier string: <hyperspace unique>
Fed 1000 files to the category spam

real 0m52.495s
user 0m18.014s
sys 0m21.557s

time ./learndir.py ./corpus/trec05p-1/hamlinks/ ham -d 1k/ -c 1000 -T 'hyperspace unique'
Using classifier string: <hyperspace unique>
Fed 1000 files to the category ham

real 0m54.831s
user 0m20.466s
sys 0m20.370s

CORRELATE classifier with no flags:

time ./learndir.py ./corpus/trec05p-1/spamlinks/ spam -d 1k/ -c 1000 -T 'correlate'
Using classifier string: <correlate>
Fed 1000 files to the category spam

real 0m30.776s
user 0m12.605s
sys 0m15.490s

time ./learndir.py ./corpus/trec05p-1/hamlinks/ ham -d 1k/ -c 1000 -T 'correlate'
Using classifier string: <correlate>
Fed 1000 files to the category ham

real 0m31.030s
user 0m12.653s
sys 0m15.406s

ENTROPY classifier with flags UNIQUE CROSSLINK:

time ./learndir.py ./corpus/trec05p-1/spamlinks/ spam -d 1k/ -c 1000 -T 'entropy unique crosslink'
Using classifier string: <entropy unique crosslink>
Fed 1000 files to the category spam

real 6m10.947s
user 4m51.545s
sys 1m5.383s

time ./learndir.py ./corpus/trec05p-1/hamlinks/ ham -d 1k/ -c 1000 -T 'entropy unique crosslink'
Using classifier string: <entropy unique crosslink>
Fed 1000 files to the category ham

real 16m46.975s
user 14m43.170s
sys 1m20.278s

DISK SPACE USAGE WITH THE DEFAULT SLOTCOUNT SETTING


This is the size of the folders corresponding to the different classifiers. Each folder contains exactly two files, so divide the listed size by two to get the size of the storage file.
22M correlate
70M entropy_unique_crosslink
14M hyperspace
13M hyperspace_unique
24M microgroom
12M osb_unigram
12M osb_unique_microgroom
2.2M osbf_unique_microgroom
24M winnow_unique_microgroom

Няма коментари: