There's a new spam filtering patent, number 6,732,157, awarded to Network Associates.
"To me this looks like a pretty broad patent," said Rob Tosti, partner in the Patent and Intellectual Property Practice Group of Testa, Hurwitz & Thibeault, LLP in Boston. [Infoworld]
But lawyers usually aren't spam filter writers or experts in antispam mathematics. So let's take a closer look at the patent ourselves.
Every independent claim has something close to:
paragraph hashing including hashing a plurality of paragraph and utilizing a database of hashes of paragraphs, wherein the paragraph hashing excludes a least one of a first paragraph and a last paragraph of content of the electronic mail messages, wherein aplurality of hashes each has a level associated therewith, and the hashes having a higher level associated therewith are applied to the electronic mail messages prior to the hashes having a lower level associated therewith
Here's some text from the specification that elucidates what they mean
in the above claim limitation:
With attention now to the paragraph hashing module 806, various
paragraphs of content of the electronic mail messages may be hashed
in order to identify content previously found in known spam. In
particular, the electronic mail messages may be filtered as being
unwanted upon results of the paragraph hashing matching that of known
unwanted electronic mail messages. This may be accomplished utilizing
a database of hashes of paragraphs known to exist in previously
filtered/identified spam.
As an option, the paragraph hashing may utilize an MD5 algorithm. MD5
is an algorithm that is used to verify data integrity through the
creation of a 128-bit message digest from data input (which may be a
message of any length) that is claimed to be as unique to that
specific data as a fingerprint is to a specific individual.
To facilitate this process, content of the electronic mail messages
may be normalized prior to utilizing the paragraph hashing. Such
normalizing may include removing punctuation of the content,
normalizing a font of the content, and/or normalizing a case of the
content.
As a further option, the paragraph hashing may exclude a first and
last paragraph of content of the electronic mail messages, as
spammers often alter such paragraphs to avoid filtering by paragraph
hashing.
Thus filters that don't do that, shouldn't infringe. That doesn't seem very broad to me. Most spam filters I have had any involvement with don't seem to care about paragraphs at all, and certainly don't seem to have something that "excludes at least one of a first paragraph and a last paragraph of content of the electronic mail messages"
Maybe I'm missing something, but I don't think this going to be the big deal that the media is already building it up to be.
Additionally the independent claims all have something like
utilizing Bayes rules to filter the electronic mail message as being unwanted based on the user-defined Bayes rule threshold;Certainly filters based on the chi technique don't do that; the theshold has nothing to do with Bayes' rule, it's based on non-Bayesian statistics. I assume that's probably the case for many or most other kinds of filters. (Most chi implementations do use Bayes in handling Paul Graham's word probabilities, but the patent specifically talks about the endpoint classification of messages using a "Bayes rule threshold". That's different.)
A lot of people who haven't had cause to study patents much tend to assume that they are very liberally interpreted by the courts, but they are not. In fact they are very rigidly interpreted according to rules that are getting narrower all the time. In particular a number of rulings in recent years have made it harder to apply the "doctrine of equivalents." The DoE essentially says that two things are the considered to be the same if they "perform substantially the same function in substantially the same way to achieve substantially same result". (That's the "Graver Tank Tripartite Test"). It's already pretty restrictive, but often not even that can be applied due to recent rulings.
So I don't think most people in the antispam community should get too worried about this patent, unless they specifically know that their filter carries out the claim limitations mentioned above.
Update: for prior art dating from 1997 that will have an impact on any any attempt to a get broad patent, see Spamometer.
Further update: Micheal Tsai, esteemed author of SpamSieve, agrees.
Comments