May 03, 2004
Update, May 10 2004: I have been working on a version of this document that is much more rigorous and concise. If you haven't read this yet, I must admit that I think you should hold off until the next version.
I've written another article on Fisher's inverse chi-square test as used for spam filtering tasks. This one's 8 pages long. It's title is "Why Chi?" with subtitle "Motivations for the Use of Fisher's Inverse Chi-Square Procedure in Spam Classification."
The immediate precipitating event for the creation of this article was some questions from Laird Breyer on the spamfilt mail list over this past weekend. And also in recent weeks, Jonathan Zdziarski has asked for more information.
Once I started answering Laird, I figured I might as well go all the way and write an article that describes the motivations I see for using the chi-square approach.
It's a very interesting subject to me involving things I've been thinking about for years. The article discusses aspects of the algorithm that haven't been discussed in any other online document, and in some ways are more fundamental than the issues covered in other documents.
I expect that this is probably the last longish article I will write on the subject of the chi-square algorithm for spam filtering for a long time to come, if not forever. Really it says pretty much everything else I have to say, except that I have a couple of other possible improvements in mind that I'll try out one day if I get a chance.
I find myself wishing this was available in reflowable form, so that I can actually sit down to read it -- but it looks extremely interesting regardless. I'll print it out the next time I see a printer and read it; thanks :)
Posted by: Richard Soderberg at May 10, 2004 1:24:24 AM