February 12, 2004
I'm going on vacation
I'll be away until the 23rd. For the first time in (possibly many) years, I'm not even going to take a laptop with me. I want to focus on my family rather than email, blogging, or coding. See you when I return!
February 11, 2004
It's hard to prognosticate
At dinner last night I gave my usual schpiel about how DaveNet started, and since I was dining with Microsoft people I emphasized the early piece I wrote about Bill Gates, and his response. I quoted Bill saying that the Internet wouldn't mean less sales for Flight Simulator or Encarta, and I said he was right but that wasn't the point. One of my companions stopped me there and said wait a minute, the Internet did mean less sales for Encarta. I was shocked. That's correct, and Gates got it wrong, and I wasn't enough of a visionary to see it. I got it wrong too. Who needs an encyclopedia on a CD-ROM when you have the Web at your fingertips? Someday some kid is going to ask you What is Encarta? That might be where you end up going today. [Dave Winer]This makes me think particularly of Wikipedia. Dave's right. But way back then, it was hard for even the brightest and best-informed to see.
The power of people to band together on the Internet to create free (as in both beer and speech) projects such as Wikipedia or Linux is an incredible and unforeseeable development in the world. It's truly an emergent phenomenon -- it couldn't be predicted until the substrate and numbers of agents were actually in place to make it happen.
OK, maybe there were some who did predict it. You can always find somebody who "predicted" almost anything by random chance alone. But the fact is, in those early days, not even Linus Torvalds predicted what would happen.
Free song for a lucky reader
I got a free song from iTunes/Pepsi today. Key is L7XTR FHN7R if I'm reading it correctly.
My gift to the first reader who types it into the iTunes Music Store "Redeem Song" area. Just email me and tell me what you got. ;)
May I suggest you check out Crossroads of America by my friend Allen Shadow. :)
Woman marries dead boyfriend
Demichel told LCI television she was fully aware that "it could seem shocking to marry someone who is dead", but said that her fiance's absence from her life had not dimmed her feelings for him.... Before the ceremony can take place, it must be approved by the French president.[SMH.com.au. Another hat tip to Andrew.]In France, they apparently truly believe in equal marriage rights for all.
Self-contradiction in the NY Times
It would be one thing if they said they'd changed their minds. But this doesn't seem to be that. I always find these instances of self-contradiction to be interesting.
"Mr. Bush said repeatedly that he went to the United Nations seeking a diplomatic alternative to war. In fact, the United States rejected all diplomatic alternatives at the time, severely damaging relations with some of its most important and loyal allies." - New York Times editorial. February 9 2004.
"Yesterday's unanimous vote at the United Nations Security Council sends the strongest possible message to Baghdad...This is a well-deserved triumph for President Bush, a tribute to eight weeks of patient but determined and coercive American diplomacy…Only if the council fails to approve the serious consequences it now invokes -- generally understood to be military measures -- should Washington consider acting alone." - New York Times editorial, November 9, 2002. [Hat tip to Andrew Sullivan]
Even so, nobody can beat Kerry for self-contradiction.
February 10, 2004
Apple has 10% of computer users??
Naysayers have been calling for Apple's demise for years. But Apple not only has survived but thrived, it seems, at least partially by the sheer force of Jobs' will and his ability to maintain the ferocious loyalty of Apple's users, who still account for 10% of the world's computer users, while its sales usually account for about 3% to 5% of the world global PC market. [Forbes]Is this true? And how could it be? Here's what one article says:
Macs last longer than Windows PCs. If Mac users replace their Mac every 4 years and PC users replace their every 1.5 years, what does that do to quarterly market share numbers? Not to mention, what does that do to landfills? The important number to analysts, marketeers, software developers, and others should be how many people out of 100 use a Mac? The answer is closer to 10 people out of 100 or 10 percent. Not 3 percent. We get tired of having to point this put, but we'll never stop doing so until the "3 percent myth" is destroyed. [MacDailyNews]
February 06, 2004
Tower Records headed for Chapter 11
The retail music channel continues to implode. They spent too much time fighting the advent of digital music, and not enough reinventing themselves. Now the only question is whether the labels follow them down the toilet. [Tim Oren]
February 05, 2004
Spam filtering: Training to Exhaustion
A couple of months ago I had an interesting email exchange with Boris 'pi' Piwinger about the "training-to-exhaustion" spam filtering technique. He contacted me with these ideas. I helped to write up some instructions, but other than that this is all his work. I promised I'd post the results to my blog, but so many things have been going on with Goombah and our NSF grant that I've had to put it off -- I just had other priorities I had to deal with.
The Bogofilter site on SourceForge gives the following definition of training to exhaustion, based on "training on error":
"Training on error" involves scanning a corpus of known spam and non-spam messages; only those that are misclassified, or classed as unsure, get registered in the training database. It's been found that sampling just messages prone to misclassification is an effective way to train; if you train bogofilter on the hard messages, it learns to handle obvious spam and non-spam too...Test results are available.
"Training to exhaustion" is repeating training on error, with the same message corpus, until no errors remain.
Note that concerns have been expressed about possible theoretical issues with the approach. Another quote from the Bogofilter site:
A basic assumption of Bayes' theory is that the messages used for training are a randomly chosen sample of the messages received. This is violated when choosing messages by analyzing them first. Though theoretically wrong, in practice "training on error" seems to work.Frankly, there are fundamental theoretical violations even in mainstream filters such as those based on the frequently-used "naive Bayes" approach or on my own work, because there is a theoretical assumption of statistical independence (not the same as the randomly chosen sample issue) which is violated by most of these techniques. But it was long ago experimentally shown that naive Bayes is actually robust against such a lack of independence. Eventually proofs were created to explain it, but they came after-the-fact. Later, my own technique was experimentally shown to be similarly robust (although I do have a technique "in the lab" to make it a bit more robust against that particular violation of the rules).
The bottom line, as far as I am concerned, is: what works, works; what doesn't, doesn't. The difference can be best determined by testing. In testing to date, training to exhaustion appears to work very well.
The exchange between Boris and me resulted in a set of simple instructions for how to do training-to-exhaustion, which may be of interest to anyone who wants to try the technique.
Update: In a response to this entry, Liudvikas Bukys thinks that training to exhaustion may be a "less-general" form of AdaBoost. I want to state that I personally make no claims for training to exhaustion relative to other approaches. I do see that it has done well in the testing that has been conducted so far by Boris and therefore seems to me to be of potential interest to the community, particularly since we have an accompanying very simple set of instructions making it easy for anyone to test. I think implementation simplicity is a real potential benefit of this technique. If it turns out that there is further discussion about the relative merits of training to exhaustion vs. other techniques through comments and/or trackbacks that would be great and could help clarify the relative value of the technique.
A Judge tells the RIAA what its problem is
One of the three judges told the RIAA attorney to stop using "abusive language", such as calling file-trading "piracy".
Here's the exact language the judge used, which Copyfight transcribed and TechDirt brought to my attention (thank you Copyfight for the transcript. EFF has an mp3 of the arguments in court, by the way also. Say, I think Groklaw started something.):
"Let me say what I think your problem is. You can use these harsh terms, but you are dealing with something new, and the question is, does the statutory monopoly that Congress has given you reach out to that something new. And that's a very debatable question. You don't solve it by calling it 'theft.' You have to show why this court should extend a statutory monopoly to cover the new thing. That's your problem. Address that if you would. And curtail the use of abusive language."
The whole piece is worth reading. Another enlightening quote from the transcription of the oral arguments:
"One academic study found that 90 percent of the content exchanged on file-sharing networks is copyrighted, [RIAA lawyer] Frackman noted.
"[Judge]Noonan pressed further, asking whether the authorized exchange of 10 percent of an estimated 750 million swapped files -- games, live recordings and public-domain works such as Shakespeare -- met the criteria the Supreme Court set forth in the Betamax case. 'That sounds like a lot of non-infringing use to me.'"
It sounds like a lot to me too... It would be hard to argue that 75 million files is not a significant amount of non-infringing use.
February 04, 2004
Monetary approach to defeating spam
Yahoo! and Microsoft are giving serious thought to the idea of e-mail "postage" that costs senders a small fee, company officials said.
The admissions come in the wake of Microsoft founder Bill Gates' January comments in Davos, Switzerland suggesting the spam problem will be defeated by a number of different solutions, but "in the long run, the monetary method will be dominant." [InternetNews]
I've occasionally collected flak from people who love the Bayesian filtering approach for asserting that the monetary approach could be a workable solution. But I have thought and continue to think it makes sense. It's just a matter of getting a critical mass of businesses to implement it. Yahoo and Microsoft together qualify as critical mass, I believe.
Essentially the idea is to either charge for all email (in which case spammers couldn't afford it) or only charge for emails rejected as spam. This can be done with "real" money, or with a currency based on expenditures of CPU cycles.
There are a number of other feasible approaches than cost-based. The main thing is some solution or set of solutions reaching critical mass followed by broad adoption by the industry. Spam is going to be all but eliminated in the next couple of years, as that critical mass is reached.