Monday, June 25, 2007

Blocking Spam: An Idea

Consider a firm filtering out spam for a million customers. One way of identifying spam is to look for messages received by multiple customers. If ten thousand people receive identical messages, it is a pretty safe bet that they are all spam.

One problem with doing this is privacy; the customers do not want someone else to be reading their mail. The comparison would, of course, be done by computer, but once the message has been sent to the spam filtering company, the customer has no way of knowing who, other than a computer, is looking at it.

There is a simple solution to the problem. Instead of forwarding your email to the filtering company, forward a hash of your email. Your own computer applies a one way hash function to each message, calculating from it a long number. If the number is long enough, the probability that two different messages will hash to the same number becomes vanishingly small. But a twenty digit number still contains much less information than a hundred word email, so there is no way of reversing the process and deducing the message from its hash. Forward the hash to the spam filtering company--doing that not only protects your privacy, it also takes a lot less bandwidth than forwarding the email. Get back information on whether or not it matches the hash of messages received by many other customers, and junk or read the email accordingly.

Have I just reinvented the wheel? Is anyone currently using some variant of this approach?

20 comments:

Jay Goodman Tamboli said...

Yup, it's been done. See http://razor.sourceforge.net/

It's even more complex than that, though, since a lot of spammers insert randomly-generated text into each message to change the hash. I believe Vipul's Razor even detects these kinds of small variations.

Anonymous said...

There are some more problems with the idea. One is about the privacy thing. For my knowledge, as long as you send and recieve classic emails through the internet, there is virtually no privacy at all. Plain emails are commonly compared to postcards. Your postcards go through various number of hands. It can go through many! Mails without encryption are not really private.
Another thing is that Bayesian filters (like in Thunderbird or such) do a really good work. Possibly combining that with online spam databases does more better work. The collective power of people marking spam as spam and trusting for some level other people's such decision seems much more powerful than statistically examining email.

Unknown said...

We are doing pretty well with ASSP (assp.sourceforge.net) It costs nothing if you have hardware lying around. It is free open-source and you can run it on FreeBSD. It's your own company that is running the spam filter so privacy is assured (in as much as privacy exists on corporate email--the corporate entity's private data is assured). It also runs a Bayesian filter and consults blacklists and other ASSP servers' spam databases.

We love it, we block about 90% of our incoming SMTP mail. Very little spam slips through and we don't loose any real mail because it has an automatic whitelist.

Patri Friedman said...

Spam nowadays is all randomized. Would have worked in the early days - but then they would have just moved to randomization sooner. Exact duplicate detection is not a strategy that holds up well to a malicious adversary.

Pace said...

Or just use gmail and rarely receive SPAM.

Lippard said...

Vipul's Razor/Cloudmark and DCC are two ways that this has been done. As Jay Goodman Tamboli pointed out, spammers defeat this by inserting random text for "hash busting" purposes, but it is still somewhat effective.

Anonymous said...

Not all duplicate e-mails are spam - there are also e-mailed publications that people subscribed to. So before you can get duplicate detection off the ground, you must encourage your customers to whitelist everything of that sort.

Then a pure hash comparison is far too easily defeatable by randomized inserts. It's still possible to use a sort of hash comparison, say hashing short strings of words and sending out a set of hashes, with some threshold number of matches triggering a reject, but as the spammer gets better I suspect you'll get to the point where the message could be reconstructed from the set of hashes. The one thing you gain in the long run is that the spammers put so much computer-generated gibberish in their messages that after they make it to your inbox, you instantly recognize them as nonsense and therefore spam.

Finally, why are most anti-spam programs unable to recognize pure gibberish and use that as a rejection criteria?

jimbino said...

Easy--if spam filters caught pure gibberish we wouldn't be able to hear from our elected representatives.

Anonymous said...

Probably it worth to mention that what is spam for a vast majority of people still can be useful information for a small minority. If it wasn't, there would be no point in sending out spam.

Anonymous said...

Killing Spam is a Hard Problem. Somebody (not me) came up with this anti-spam checklist, which you might find useful (and possibly funny).

See http://www.craphound.com/spamsolutions.txt

Anonymous said...

The biggest problem here is that mass-subscription or widely-available spam blocking systems are accessible to the spammers. They can adapt their techniques to learn how to bypass the blocking algorithms, and can analyze those algorithms either by looking at source or by reverse engineering.

SheetWise said...

There is a simple solution -- postage. I send a lot of email, but not enough that I would mind spending 5c on each.

If a private company was set up so that all net revenues from the subscriber would go to the charity of their choice -- the government would hesitate to intervene or take over. If mail clients then validated postage and indicated which messages were paid -- users would have a great incentive to block unpaid mail just as they block anonymous phone calls.

If spammers had to pay .05 to send an email -- they would be brought to their knees. For those of us who could simply eliminate the static -- .05 would be a small price to pay, and for a cause we believe in (everybody believes in something -- right?).

Anonymous said...

But that would straight away kill major mailing lists ($0.05*100000 subscribers = $5000 for every mailout.)

Also, it would be a significant expense for a lot of other people. Plenty of people regularly send several hundred emails per day. For every hundred, that would cost $5.

Anonymous said...

There have been proposals where the recipient can waive the postage by not marking the incoming message as spam. Thus there's a charge only for email that the recipient does not want.

John Fast said...

As far as I know, the service Cruelmail.com follows the procedure given by anonymous -- and which was invented by David Friedman.

Anonymous said...

My understanding is that Google has a similar approach with its "Report Spam" feature. If enough people report a particular message as spam, that sender is automatically marked as spam for other users in the system. Since spam doesn't arrive simultaneously, this approach tends to be effective at keeping spam from the inbox. The feature is also simple; it's as easy as deleting the message.

Anonymous said...

A highly effective, but slightly inconvienient, approach to the spam problem is using a whitelist. People not on the whitelist get an automated response which requires answering a question; if the response is correct, then the email is allowed through. Now here's the key: the response is *not* standardized, but created custom by each email user. A spammer would have to pay someone to manually enter responses. If somebody is crazy enough to do that, and you start getting spam, you just change your automatic response and the spam stops.

Unlike other approaches, there's no arms-race aspect. The spammer can "pay" or not, but there's no system to break.

Raphfrk said...

That's a pretty good idea.

You could also email the person a captcha. I don't think it would be that inconvenient, really, as long as it remembers emails that were allowed through.

I wonder if there are any people doing text-captchas using ascii art ? :)

Anonymous said...

"You could also email the person a captcha."

That was my first thought, but a query that is custom-made by each user is stronger. It could be anything from "what's the square root of 4?" to "what's my cat's name?" (the latter filtering out email from strangers as well as spammers). Infinite variety with no development cost.

Anonymous said...

Or there are solutions proposed where that process is automated - the sending computer has to figure out some computational puzzle that might take 20 seconds.

The problem with all such solutions is that of legitimate mass-mailings. Speaking to anyone who runs a large mailing list, a non-trivial number of users will change email providers, or programs, addresses, etc., between each mailing. With a large subscriber base (say 100,000), even a very small percentage of users not whitelisting properly would result in an enormous amount of work for the list-maintainer. Often these lists are free, and the maintainer can't afford to pay someone to wade through hundreds of challenge-responses every week.

Also, even with non-mass-mailings, plenty of people send hundreds of emails every day, often plenty to people they haven't previously corresponded with. A puzzle that takes them 20 seconds to deal with * 10% * 600 emails per day (just for example) gives 20 minutes per day.