April 28, 2004
Handling Redundancy in Email Token Probabilities is a new article by yours truly about the chi-square-based technique for spam detection. (That's the PDF, here's the Postscript version.) The abstract:
One of the many techniques which has recently been employed for filtering spam is one describedin the Linux Journal article A Statistical Approach to the Spam Problem. This technique incorporates ideas from the seminal article A Plan For Spam as well as R.A.Fisher's technique for combining p-values by means of the chi-square distribution. The technique presented in here takes the chi-square-based approach a step further by taking into account two facts: a) there is redundancy in the token probabilities, and b) spam andham emails have different amounts of such redundancy. Fivefold cross-validation was carried out on the new technique and is described here testing whether these factors actually lead to better performance. The results were positive and statistically significant.
Note: the link to the article above is to version .94; earlier versions are still available still here , here, here, and here, but they have some typos, and the original has an error in Eq. 3. Many thanks to Greg Louis and David Relson and others for pointing them out. Also, Laird Breyer has pointed to areas where I could have done a better job of explaining things, so that has also led to improvements in the new drafts and is appreciated. You should definitely use version .94.
There's (in version 0.91) a typo two lines below Eq. (5) (the phrase should read "a great deal of evidence of hamminess as well as a great deal of evidence of spamminess").
Posted by: Stefan Mashkevich at Apr 29, 2004 3:38:34 PM
Thanks! I'll correct that in .92.
Posted by: Gary Robinson at Apr 29, 2004 3:47:20 PM