A couple of months ago I had an interesting email exchange with Boris 'pi' Piwinger about the "training-to-exhaustion" spam filtering technique. He contacted me with these ideas. I helped to write up some instructions, but other than that this is all his work. I promised I'd post the results to my blog, but so many things have been going on with Goombah and our NSF grant that I've had to put it off -- I just had other priorities I had to deal with.
The Bogofilter site on SourceForge gives the following definition of training to exhaustion, based on "training on error":
"Training on error" involves scanning a corpus of known spam and non-spam messages; only those that are misclassified, or classed as unsure, get registered in the training database. It's been found that sampling just messages prone to misclassification is an effective way to train; if you train bogofilter on the hard messages, it learns to handle obvious spam and non-spam too...Test results are available.
"Training to exhaustion" is repeating training on error, with the same message corpus, until no errors remain.
Note that concerns have been expressed about possible theoretical issues with the approach. Another quote from the Bogofilter site:
A basic assumption of Bayes' theory is that the messages used for training are a randomly chosen sample of the messages received. This is violated when choosing messages by analyzing them first. Though theoretically wrong, in practice "training on error" seems to work.Frankly, there are fundamental theoretical violations even in mainstream filters such as those based on the frequently-used "naive Bayes" approach or on my own work, because there is a theoretical assumption of statistical independence (not the same as the randomly chosen sample issue) which is violated by most of these techniques. But it was long ago experimentally shown that naive Bayes is actually robust against such a lack of independence. Eventually proofs were created to explain it, but they came after-the-fact. Later, my own technique was experimentally shown to be similarly robust (although I do have a technique "in the lab" to make it a bit more robust against that particular violation of the rules).
The bottom line, as far as I am concerned, is: what works, works; what doesn't, doesn't. The difference can be best determined by testing. In testing to date, training to exhaustion appears to work very well.
The exchange between Boris and me resulted in a set of simple instructions for how to do training-to-exhaustion, which may be of interest to anyone who wants to try the technique.
Update: In a response to this entry, Liudvikas Bukys thinks that training to exhaustion may be a "less-general" form of AdaBoost. I want to state that I personally make no claims for training to exhaustion relative to other approaches. I do see that it has done well in the testing that has been conducted so far by Boris and therefore seems to me to be of potential interest to the community, particularly since we have an accompanying very simple set of instructions making it easy for anyone to test. I think implementation simplicity is a real potential benefit of this technique. If it turns out that there is further discussion about the relative merits of training to exhaustion vs. other techniques through comments and/or trackbacks that would be great and could help clarify the relative value of the technique.