Articles by Guest Poster

You are currently browsing Guest Poster’s articles.

I was asked by Josh Amer to do a guest post. I run, amongst other businesses, iBegin, a local search engine. Right now only in Canada, it will soon be expanding into the US.

There is a lovely little buzz going on about user-generated content. Tired webmasters no longer need to work and write quality content - nay, let the users carry that burden. After all, they get something out of it… don’t they?

Many of them are beholden to the idea that (in general), users are good. If there is one bad user, there are ten others to stop the vagabond. Pretty grand isn’t it?

Alas, two major holes crop up:

1. Often times, the reporting user is taken at face value. The algorithm seems to be rather simple. Every time a user-generated entry is reported as spam, internally the system does this: +1 spam_report. If spam_report > 5 (5 people have reported this as spam), hold or delete. It seems while provisions were made for malignant contributors, there were no provisions made for malignant ‘helpers’. To be honest, I have not seen a single website where this simplistic approach is not taken. This even works on Digg: observe the cloud view of upcoming stories. In my own random testing, it took roughly 5 ‘this is lame’ for the bottom stories, 7 ‘this is lame’ for the middling stories, and 9 ‘this is lame’ for top stories. Frontpage stories took roughly 11-13 ‘this is lame’. I have enough employees to neuter almost any story. The former #1 user P9 had this done to him - every single story he submitted was immediately buried. Eventually he ‘quit’ - in really he had been neutered and could make no impact on the site. With the stakes higher as Digg becomes more popular, suppressing a competitor’s story becomes rather useful. (NOTE: I only buried spam/duplicate stories.)

A few sites are starting to create UserRank values, akin to Google’s PageRank. The thinking is reasonable - if we know the ‘quality’ of a user, we can know if his/her contributions (be it new submissions, reports, etc) are valuable or not. Noble, but this leads into point #2 …

2. How do you know a user really is a user? In order to understand the challenges faced by user-driven websites, I have started delving into some blackhat SEO (purely research). Suffice to say, sites like Digg and Reddit are already being heavily abused. Image captchas and so forth? All you need is a list of open proxies, a pinch of cURL, a dash of OCR software, mix well, and you have an automated system to run roughshod over any of the existing ’social’ systems. Just generate some rules and the system can be digging or redditting or bookmarking within an hour. And email validation? All you need to do is pipe all the email addresses to a single script and simply fetch the URL contained within. Easy as pie.

The processing power required to really weed out ‘networks’ of users is immense. Digg has tried to do this for submissions (but not for reports) - if you often digg the same user’s stories, eventually your digg counts less. Of course, in reality this only works for real users. An automated system will have a unique IP (courtesy of proxies), a unique signup name (just take a list of first+last names, and concatenate them together with two random numbers at the end), and a unique ‘voting’ history (all votes are randomized). There is simply no way to know that all these (fake) users are interlinked.

The above two points are very important as about a month ago I set out to make user-driven politics website (coming soon at Wing Politics. Having already
seen how ugly Digg’s political section got, it was obvious to battle #1 I needed a UserRank system. Yet I also had #2 to contend with.

The answer was actually quite simple. A major site was already doing it, the cost was low, and its only downside was it did require some trust.

With that in mind, I make a bold prediction:

As user-driven websites become increasingly manipulated (in more and more sophisticated manners), they will have to start ‘validating’ that a user exists. The preferred choice of validation will be by sending a validation code an SMS to a user’s cellphone

Google’s GMail is already doing this - the amount of spam coming from an @gmail.com address is almost nil. People who contribute to such sites heavily skew towards technophile/younger demographic - highly likely to have a cellphone. The cost, both time and monetary, would be rather significant for an abuser to gain enough trusted user accounts. The cost of sending an SMS is not very high, and as long as the user can be convinced that their cellphone # will not be used for any spam/marketing purposes, you have a solid way of ensuring the uniqueness of a user.

This post turned out to be rather lengthy, but I wanted to elucidate on the over reliance (and implicit trust) users have in most web 2.0 sites. I am also sure as exploits become more commonplace the solution I have proposed will become much more common.