User Generated Headaches

I was asked by Josh Amer to do a guest post. I run, amongst other businesses, iBegin, a local search engine. Right now only in Canada, it will soon be expanding into the US.

There is a lovely little buzz going on about user-generated content. Tired webmasters no longer need to work and write quality content - nay, let the users carry that burden. After all, they get something out of it… don’t they?

Many of them are beholden to the idea that (in general), users are good. If there is one bad user, there are ten others to stop the vagabond. Pretty grand isn’t it?

Alas, two major holes crop up:

1. Often times, the reporting user is taken at face value. The algorithm seems to be rather simple. Every time a user-generated entry is reported as spam, internally the system does this: +1 spam_report. If spam_report > 5 (5 people have reported this as spam), hold or delete. It seems while provisions were made for malignant contributors, there were no provisions made for malignant ‘helpers’. To be honest, I have not seen a single website where this simplistic approach is not taken. This even works on Digg: observe the cloud view of upcoming stories. In my own random testing, it took roughly 5 ‘this is lame’ for the bottom stories, 7 ‘this is lame’ for the middling stories, and 9 ‘this is lame’ for top stories. Frontpage stories took roughly 11-13 ‘this is lame’. I have enough employees to neuter almost any story. The former #1 user P9 had this done to him - every single story he submitted was immediately buried. Eventually he ‘quit’ - in really he had been neutered and could make no impact on the site. With the stakes higher as Digg becomes more popular, suppressing a competitor’s story becomes rather useful. (NOTE: I only buried spam/duplicate stories.)

A few sites are starting to create UserRank values, akin to Google’s PageRank. The thinking is reasonable - if we know the ‘quality’ of a user, we can know if his/her contributions (be it new submissions, reports, etc) are valuable or not. Noble, but this leads into point #2 …

2. How do you know a user really is a user? In order to understand the challenges faced by user-driven websites, I have started delving into some blackhat SEO (purely research). Suffice to say, sites like Digg and Reddit are already being heavily abused. Image captchas and so forth? All you need is a list of open proxies, a pinch of cURL, a dash of OCR software, mix well, and you have an automated system to run roughshod over any of the existing ’social’ systems. Just generate some rules and the system can be digging or redditting or bookmarking within an hour. And email validation? All you need to do is pipe all the email addresses to a single script and simply fetch the URL contained within. Easy as pie.

The processing power required to really weed out ‘networks’ of users is immense. Digg has tried to do this for submissions (but not for reports) - if you often digg the same user’s stories, eventually your digg counts less. Of course, in reality this only works for real users. An automated system will have a unique IP (courtesy of proxies), a unique signup name (just take a list of first+last names, and concatenate them together with two random numbers at the end), and a unique ‘voting’ history (all votes are randomized). There is simply no way to know that all these (fake) users are interlinked.

The above two points are very important as about a month ago I set out to make user-driven politics website (coming soon at Wing Politics. Having already
seen how ugly Digg’s political section got, it was obvious to battle #1 I needed a UserRank system. Yet I also had #2 to contend with.

The answer was actually quite simple. A major site was already doing it, the cost was low, and its only downside was it did require some trust.

With that in mind, I make a bold prediction:

As user-driven websites become increasingly manipulated (in more and more sophisticated manners), they will have to start ‘validating’ that a user exists. The preferred choice of validation will be by sending a validation code an SMS to a user’s cellphone

Google’s GMail is already doing this - the amount of spam coming from an @gmail.com address is almost nil. People who contribute to such sites heavily skew towards technophile/younger demographic - highly likely to have a cellphone. The cost, both time and monetary, would be rather significant for an abuser to gain enough trusted user accounts. The cost of sending an SMS is not very high, and as long as the user can be convinced that their cellphone # will not be used for any spam/marketing purposes, you have a solid way of ensuring the uniqueness of a user.

This post turned out to be rather lengthy, but I wanted to elucidate on the over reliance (and implicit trust) users have in most web 2.0 sites. I am also sure as exploits become more commonplace the solution I have proposed will become much more common.

Thanks for posting this, it’s a very real problem and you’re right it doesn’t get enough attention. I think partially because sites like Digg and other SNs want to inflate their user numbers. To an extent having fake accounts helps - you’ve never heard someone from Digg say ‘we have 20 million user… but 20% are fake.’

Aye those inflated numbers can be nice for selling ads (I did write an article about MySpace inflation - http://forevergeek.com/articles/debunking_the_myspace_myth_of_100_million_users.php ) but that is all short-term gain with disastrous long-term effects.

I find myself using Digg less and less - it seems like every iteration makes it less interesting and useful to me. About the only ‘web 2.0′ site I have kept up with is delicious, and only because of its complete utilitarian function.

I notice that too in my own habits, I tend to use the utilitarian sites and can’t ever really get sucked into any Social Networking sites. I guess to me it’s a matter of time wasters v. time savers.

There are other low cost methods, that may require a bit more investment at start up, such as more difficult captchas.

I have already started developing a captcha for my sites, and will make it more difficult when users start to abuse it.

SMS messaging is a good approach, but you end up losing a section of your users, by either them not being tech saavy or not wanting to share their cell phone number (no matter how much you assure them it will not be abused).

Some site’s have used credit cards as verification, though that would be more difficult to get user’s to accept than SMS in some cases.

You could develop a system that tracks the navigation of a user, and flags users that seem to be acting strange (pattern in the interval between queries, not requesting all images and javascripts, perhaps a javascript application that randomly will use AJAX to post back a heartbeat). All these things can be quite easy to implement in some cases, and much more difficult to get around. For instance, you can no longer just use cURL to query the pages, as you need to process the ever changing javascript applications.

An extreme would be to implement javascript encryption as well for posts back to the server, and have the method of encryption obfuscated.

At a given point, the cost of developing a system to crack your system would go out of the reach of benefit of having such a system.

Just my 2 cents.

Wojjie - I mainly use pattern recognition and it works pretty well, but there are times when a pattern doesn’t match what I’m looking for even though it’s not a human actor. I don’t think you can look at this stuff in isolation and say that there’s one best answer. I really like the SMS method because it forces the action offline. When the method is built into your site it’s almost certain that there will be someone that can crack it.

Google does this in another way which you probably don’t know about - for Google Maps if you want to change the data in your listing they will mail a password to you at the currently listed addresss. At least that’s how it was about a year ago.

Captchas are weak at best (are you going to captcha everything? If just the registration part, spend some time to create 100 users and tada you are good to go). Javascript is also weak - beyond the fact that many offices force JS off, a bot follows specific rules. All it takes is browsing target site X for a while to know what JS you need to follow and to do it. Google already parses out javascript links, and those are general cases. When you are targeting a single site, writing rules to follow the JS is not very difficult :)