There’s been a lot of talk recently about spammers posting comments in Moveable Type to attract people to their sites. Jay Allen has created MT-Blacklist which blacklists comments based on the linked content. He surmises that the linked content is what determines what’s spam and what’s not, I agree. ScriptyGoddess has a guide to stopping Spam.
TeledyN has discovered evidence of weblog SpamBots, which Kevin Donahue’s CAPTCHA comment system should do a good job of blocking. Personally I don’t like CAPTCHAs because my fingers don’t do well typing random characters. They prefer english words, or at least english-like words such as floomper or interflumoxicity.
Obviously it sucks for people who run weblogs to find their conversations attracting spammers. In my short stint using MT I’ve already encountered a few of these comments, and I can only expect this to get worse. I read somewhere today something equivalent to “blogging can’t be cool anymore, the President has one” which means it’s only going to get more popular. Popularity is always a target for spam.
So what else can be done about it? I’m surprised no one has mentioned Bayesian filtering of comments. Like most people who’ve heard of it, I first found out about Bayesian filtering from A Plan for Spam, and how it can identify spam. Since then virtually every spam blocking system has started using Bayesian techniques for at least some part of identification.
I believe that a plugin could be written for Moveable Type that could identify spam based on the content of the comment and the content of linked sites. A score could be assigned to each comment and if the comment met a certain threshold it would be queued or deleted. The queueing process would also allow site maintainers to build their own blacklists and whitelists.
Since Bayesian algorithms "learn" this is ideal for stopping spam comments. A site dedicated to porn would most likely link to porn sites, which many sites would consider spam, but they would still want a way to filter out mortgage spam. Also, since each install is unique, spammers wouldn’t be able to craft their ads to get around the filters.
So what do you think, is this something the LazyWeb has built/could build? Are there any fatal flaws with using Bayesian algorithms to identify spam comments?

,

8 responses to “Moveable Type Comment Spam”

  1. Proof once again that great minds are frequently unoriginal with respect to one another, I had just moments ago proposed much the same method to the Drupal developers list, maybe not specifically Bayesian, I proposed that we start the existing display/discard threshold value based on a spamassassin sort of 5-point scale where postings above some threshold would be held pending approval, above a higher threshold would be certified as spam and never recorded, and anything under the limit score would pass through without any special challenge or any other noticable difference from the way things are today.
    In addition, if the regex pattern file or the bayesian word-freq files were external to this process, they could be traded the same way virus hunting software are able to auto-update to the latest blacklist.

  2. bayesian filtering has been brought up many times, actually, like on the MT support forum. However, I don’t think that the typical weblog comments are long enough to really give a decent corpus for analysis. Analysing the linked sites could probably work, though, but there’s plenty of other challenges to deal with, like how lots of those sites are pure images with lots of javascript and so they can’t really be tokenized for lexical analysis that easily AFAIK.
    What I’d rather see is some sort of registration key scheme, like if you try to post a comment, it gets queued and generates (and stores in a cookie) a random “posting authorization” key, which can then be used on future posts (and is otherwise never displayed publically). And of course, the keys can be revoked.
    Though personally, i’d just be happy with a moderation interface which doesn’t suck as bad as MT’s.

  3. Fluffy – I agree that the text of most messages is too short to give you a decent read, but I believe that the links hold the key to filtering. What good is comment spam if it doesn’t link to anything? I suspect that any javascript or images that the spammers put in will be added to the filter by the algorithm.
    One way I could see this being defeated is with the spammer’s website sending the link checker bogus information. Example: spammer knows that he will be sending out spam comments from 01:00 to 13:00, so for those 12 hours his website shows something harmless and generic. Once all his spam comments are posted, he switches his site to it’s real, commercial content.
    As for registration, the rdfweb-dev list is discussing Drupal’s implimentation of remote authentication right now. Drupal is doing some interesting things and I hope to see things like distributed remote authentication rolled into other programs. I hope to see a FOAF implementation, but I’ll takes what I can gets.

  4. Bayesian filter for MT

    Many people have complained about my “Solution for comments spams” is unfriendly to disabled or those who do not have a graphic browser. Hearing your feedbacks, I spent the last 2 days working on a bayesian plugin. To cut the…

  5. Bayesian filter for MT

    There’s been a lot of talk over installing the spam comment filter for Movable Type. George even mentioned the possiblity of writing a Bayesian filter for MT. Thankfully Jame’s Seng caught on to the idea, and created that filter. If…

  6. Spam Comments and JSPs

    So I was reminded that I really needed to post again by getting MT-Comment spam (other views on it here and elsewhere..) that’s been going around lately, posted not only to my little unpopular weblog, but to a really old entry in it.. So what have I be…

Leave a Reply