Monday, December 7, 2009

Junk Email Filter Spam and Virus Filter

SkyHi @ Monday, December 07, 2009

How it Works!

The most advanced Spam and Virus Filter on the Planet!



It's all in the details

"So Perkel", you might ask, "What the hell do you do that makes your spam filter so damn good? There are plenty of mail hosts that use Spam Assassin like you do and don't get near the results you get."

The secret to this spam filter goes far beyond just using Spam Assassin. We use every trick out there and then some that we invented ourselves. But the real trick is integrated it all together into a system that becomes far more accurate than the sum of it's parts. We also don't have the "inside the box" limitations of a lot of network admins so we view spam as something that isn't a black and white issue - but as shades of grey and I process it that way.That's how we get 99% plus accuracy because we can both label messages as spam yet pass it on to you in a way that you don't lose email from false positives. And we give end users the ability to customize their email experience so that they can fine tune what they get and don't get.

Most spam filters make the mistake of deciding if a message is or is not spam. In this model they are either right or wrong. They either delete the spam causing the recipient to lose the false positives - or they put all the spam into one giant folder where the false positives are buried in thousands of spams where they never will be found. If you get so much spam that you will never find a good message even if it's saved then it is as good as lost.

What is different about our filter is that it grades the spam and non-spam into multiple levels and treats them all differently.This system differentiates between low scoring spam and high scoring spam. It separates out the low scoring spam and delivers it to the recipient, but tagged as low scoring spam. So - if nonspam is wrongly tagged as low scoring spam - the user still gets it - and they get it in a way that they can actually find it. So - false positives are not lost like in other spam filter systems.

Since spammer are on the run and quickly generate a hign number of complaints they have to bahave differently than real legitimate email servers do. Spammers are constantly trying to trick the system and impersonate other people. But it is these tricks that give them away because once we can identify the trick then that becomes a rule that catches 100% of the spam using that trick. We try to focus on the spammer's behavior to identify spam rather than on the content of the message because behavior based identification is far more accurate than content scanning.

Even the high scoring spam has several levels. The highest levels are rejected at connect time. This filter starts processing the message as it is being delivered. Spammers often are deferred because they connect to my fake highest MX record or try to impersonate one of my domains, or have virus content. 99% of spam is dropped without even having to look at the message. Of what is left the very high scoring spam is dropped and the high scoring spam in bounced.

The multiple classification technique reflects the reality of messages because some email is neither fully spam or fully not spam. A newsletter that you subscribe to is not spam even though it contains ads and if I got the newsletter without signing up for it - I would call it spam. Sometimes you get unsolicited political email - and if you agree with the politics - it's not spam. But if you don't agree with it - it is spam - to you.So some messages classified as low scoring spam are not really false positives - they really are low scoring spam. But what is different is that this system delivers this to you - but delivers it tagged in a way that makes it easy for you to first get your nonspam - but to check on your spam to see if anything was marked wrong.

These servers runs a mix of off the shelf technology as well as custom technology that I developed.Most of what we developed is the way we put together the off the shelf technology and the concept of grading messages far beyond just spam and nonspam. The advantage of using off the shelf technology like Exim, Spam Assassin, and ClamAV is that lots of other people are working on these programs in order to constantly make them better and they are widely tested. As they get better this system gets better. And much of the work that we develop here goes back into these products either in the form of rules we develop or features I request. Most of the developers we work with have been very good about adding new features as we request them. Many of the spam filtering technologies that are used in open source software spam filtering were developed here.



Working with the Spam Filtering Community

Some people ask, "Does it make sense to work with your competitors?" The simple answer is, in this business, yes it does. We work with both open source project and comercial vendors because we are all on the same side fighting spam and if we work together we all can do a batter job. If we do a good job then there is a bigger market for all our products. Instead of fighting over our share of a small pie, we grow the pie and we all do well. Our role in the spam filtering community includes:

  1. We started out writing Spam Assassin rules to help improve the accuracy of one of the finest open source spam filtering projects on the planet.

  2. We make our Hostkarma database public so that anyone on the planet can access our black/white/yellow/brown lists. We use other people's lists for free and we return the favor by making our lists public and free.

  3. We capture spam and reduce it to hash codes that we share with other services such as Ixhash and Razor databases.

  4. We harvest URI information from spam and feed it to URI blacklist providers.

  5. We provide IP based Hostkarma data to both private and public blacklists and whitelists for use by open source and commercial spam filtering projects.

  6. We provide information to third party Clam Antivirus database providers who target phishing and fraud scams and create fingerprints to identify these scams.

  7. When we detect phishing and fraud email we forward the email to various authorities and abuse email addresses so that these sites can be quickly shut down.

  8. Like AOL, we send out automated notifications to services providing Internet access to alert them about IP addresses that appear to be virus infected spam bots or hacked computers. This allows admins to be alerted early and take action to shut down spam at the source. We feel that the best way to block spam is to eliminate the spammer at the source. Our automeated reports and self service analysis tools allow admins to quickly find and fix the source of the problem.

  9. We share our technology with other providers. Often we improve on their technology and send them our improvements for everyone to use. Sometimes they improve on our technology taking an idea that we developed that barely worked and turning it into something that works great.

Cooperation has proven to be one of our best policy decisions in blocking spam. Because of cooperation we can block far more spam than we could if we went it alone. And besides the satisfaction of blocking millions of spams a day on our servers, we know that our public information and our data feeds are being used to block billions of other spams a day and to shut down the spammers at the source. This makes spamming less profitable and helps to reduce spam planet wide. We feel like we are cleaning up the trash on the information highway and making the virtual world a better place to surf.



Most Spam is Rejected at Connect Time

Most of the spam doesn't even make it to Spam Assassin to be rejected. It's rejected at connect time. We use a variety of tricks to do this. As it turns out - spammers are becoming more clever - but it's often their clever tricks which cause them to be more accurately identified as spam. Real email doesn't pull these kinds of stunts. And it seems like most spammers have an easily identifiable stunt. Interestingly enough - it's the old fashioned low tech spam that has an easier chance of getting through than the new high tech spam does.

We use EXIM as my mail transfer agent. The scripting language in Exim is far superior to anything else out there and the things we are doing can not be done with any other MTA. Exim gets rid of more spam than Spam Assassin.

Here's some of what we do at connect time.

  1. A lot of spammers target the highest MX record instead of sending to the lowest one like they are supposed to. They figure that the "backup" mx server probably has the least amount of spam protection. This is a standard trick of virus infected spam bots. These spammers usually go for the highest MX and never retry the lower ones. So - our simple solution is that on our highest MX record we have a dummy server that returns a temporary error on EVERYTHING that connects to it. The temporary error tells the server there's a problem and come back later and try again. Spammers rarely do. This server is actually on the same computer as our lowest MX record so it is never really up when the main one isn't and in theory should never get a legitimate email. But - in case it should the temporary error will allow it to retry the correct server and deliver the email a little later without ever losing a real message. Of the spam this rejects - it's 100% accurate.

  2. Additionally when hosts connect to our fake MX records their behavior is noted and stored in our blacklisting system. We have developed our own database called Hostkarma which we use to track the reputation of the sending host IP. This drives our black, white, yellow, and brown lists that help us preroute email coming from IP addresses based on their past reputation. We look for things like if the sending server closes the connection properly using the QUIT command or if they leave it open as spammers usually do because sending QUIT takes additional time and bandwidth. IP addreses that connect to our fake MX records and don't issue QUIT commands are quickly identified as spam bots and are blacklisted much faster than most other blacklisting services.

  3. Besides tracking bad IP senders in our Hostkarma database we also tract good hosts for white listing and mixed hosts for yellow listing. Some servers never send any spam at all under any conditions. So we track those servers as well and we identify good sources of email agressively so that when good sources send email to our servers we can fast track it and forward it without further testing that might otherwise result in a false positive. This not only improves accuracy but reduces system load and allows our servers to process more email at faster delivery times.

  4. Yellow listing is for mixed source IP addresses like Yahoo, Hotmail and GMail. Most of their mail isn't spam but some does get through. If they are doing a good job they could accidentally get whitelisted. If they were doing a poor job they could accidentally get blacklisted. So we invented a classification called yellow listing so that servers that are yellow listed skip all host based testing and are protected from either black or white listing. A lellow listed host skips lookups in other IP based black lists and thus reduces false positives.

  5. We do use other black lists. Some black lists are very good and nearly 100% accurate for those IP addresses they list. There are other lists that catch more - but they have too many false positives. We do use these lists too - not to reject email - but to force servers to retry to see if they are a real server or a spam bot. Remember, spammers don't retry so much spam is rejected by creating some extra work for the sender. Spammers can make more money going on to the next unprotected system that trying to get through our fortress.

  6. Hosts with no reverse DNS are also probably spammers. And hosts who we can not verify sender address are often spammers, although in this classification there are a lot of just poorly configured server. But - if several of these indicators are combined we can safely reject the email at connect time. For example - if an IP address is blacklisted and it has no reverse DNS - I drop the connection or force a retry. Or if the IP address is blacklisted and we can't do sender verification we drop the connection.

  7. We also use Forward Confirmed Reverse DNS (FCrDNS). FCrDNS allows us to verify if the reverse DNS is real or fake. After we look up the reverse DNS we look up the returned host name to verify that it resolves back to the original IP address. If it is fake or misconfigured we flag it for higher scrutiny. For thise hosts that pass FCrDNS we can use the host name to apply to our name based black lists and white lists in our Hostkarma database. We are the only database that supports both IP and name based lookups and white/black/yellow/brown results on those lookups. Name based lookups where the hostname is confirmed is extremely accurate because it's something that spammers can't spoof. We are agressively building list data based on name based lookups to improve the speed and accuract of our systems.

  8. Once the host has made it to the point of connecting - it has to say HELO and identify itself. Many spammers - for some reason that I don't know - try to impersonate one of my hosts. This is a dead give away that they are spammers. I think it's because the clients who I forward email for identify this way and the spammer is pretending to be one of my customers. However - my real customers are either coming in on a blessed IP address or have authenticated themselves to the system. So - if they are pretending to be one of my domains after that - they are definitely a spammer - and I drop the connection.

  9. Systems that use an IP address for a HELO are dropped. Why they do that is a mystery to me because it's a dead giveaway they are a spammer. If the spammer tries to impersonate one of our domians to try to trick us into believing they are local email, we can detects that and bounce it as well.

  10. After the HELO the sender sends the sender address. we do have some black listing of our own for a few pests that we have banned from the system. But the real trick that gets rid of spam is that we do sender address verification as soon as the sender address is received. We use sender address verification in a responsible way. We can cache results reducing repeated lookups and we can track hosts that use wildcard addresses to prevent repeated lookups to these hosts. We also only do sender verification after we do recipient verification so spammers who are doing dictionary attacks spoofing some third party domain doesn't cause us to create a lot of traffic to the thrid party's servers.

    Sender address verification is done to determine if the sender's email address is actually a working email address. If it isn't - I drop the connection. I do this by initiating a bounce message to the email address that the message is supposed to be from. If the message is rejected as an un-routable address - I drop the connection. If however there is some sort of connection failure and the message fails for other reasons than being un-routable - I accept it - but I tag it so that the bayesian filter can consider it.

  11. If a message has made it this far then they send the recipients list. Once we get that far we do recipient callout verification to see if the email addresses that are being sent to are real email addresses. And Exim keeps a count of bad recipients. If the email is sent to multiple users on our domain(s) and there are more than 3 bad email recipients in a single connection - then it's someone who is phishing for names to spam or has harvested fake email names off web sites and is trying to spam the fake names. I use several of these just to trap email harvesters and detect them. Sending email to lots of bad recipients and to spam trap recipients causes me to drop the connection and prevent the spam from going to the real email addresses that were harvested as well. This has also been very effective and accurate.

  12. We employ a sophisticated honeypot system that contains known bad recipients that are on no legitimate lists. Spammers harvest these names and include them in their spam lists and we can detect spam sources by those who are trying to spam these recipients.

  13. We also use temporary errors to reduce system load and to reject spam. We have developed a new technique we call "The Penalty Box" where if we receive a spam from an address we temporarily block subsequent spam attempts with a temporary reject error. This error is a "come back later" error the servers usually use if there is some sort of temporary system error. Often spammers send the same spam over and over and the system identifies it over and over. But spammers usually try a message to a specific recipient only once and move on where real email tries over and over. The temporary error cause a rejection without having to process the email. This is similar to "grey listing" but without the side effect of delaying mail from senders unknown to the system.

  14. Spam can often be detected by looking at what the spammer wants you to do. In many cases they want you to click on a link. We extract links from email and look them up on URI blacklists which are lists of web sites that spammers link to. This is a very effective means of spam control. We were the first service to develop URI based spam filtering and presented the trick to the spam filtering community who later took the idea and developed URI blacklisting. This is an example where the sharing of technology has paid off in that they make it far better than we could ever do ourselves.

  15. If the message makes it this far then it is scanned for file attachments. We let the virus filter chew on all attachments - but we also disallow windows executable attachments. At present we do virus scanning first just to get an idea of what's out there and how fast it's spreading. If the attached file is a windows executable but not detected as a virus - we reject it anyhow. No sane person is sending windows executables these days and no sane person should open one if they get it. So if it isn't caught by the virus scanner then it's a new virus that hasn't been added to the virus database yet. There is often several hours between when a virus starts spreading and when the virus databases are updated. On my system I'm using the ClamAV virus scanner and we update the virus definitions every 30 minutes. But - we are still not going to pass unknown viruses - so - windows executables get nuked.

    That leaves the compressed ZIP file viruses which are not as easily spread. The zip files have to be opened in order to do their dirty work and the come with a message that tries to trick the user into opening the file. The virus scanner does catch these once the virus definitions are available - but - some of these can and do make it through my email system. So - unless you are sure of what you are getting - be very careful about opening attached ZIP files. Virus rejection blocks thousands of viruses every day. These are all blocked at connect time and are rejected before the message is accepted.

    Server side virus filtering is far more accurate than having anti-virus software on your computer. Most people only update their virus definitions weekly so when a new virus comes out - they are still vulnerable. On the server side I can update the definitions every 30 minutes which greatly reduces the attack window of a virus. And - by rejecting Windows executables I can make sure that most of the new viruses are never delivered to your inbox at all.

There are other connect time tricks that either reject email, slow it down, or result in warning headers being added for later processing. But the combination of these techniques gets rid of the bulk of spam with near 100% accuracy without even having to process the message.


Then Spam Assassin Gets It

After the preprocessing is done Exim hands the message to Spam Assassin for scoring and bayesian analysis. Bayesian filters are a statistical filter where you train it on what spam is and what non-spam is and after it is trained it looks at the message to see if it is more close to spam or non-spam. The result is a percentage and it is scored as such. If it comes up 50% then no scoring is done. But at the edges it can add many points or subtract many points based on what matches.

Spam Assassin has thousands of rules as well as the bayesian filter. These rules add or take away points based on matching the rules. Spam Assassin has made huge advances in accuracy over the years. The rules have become more and more accurate and there are more rules to affect the score. The more rules the better - except for processing loads - it increases the accuracy. Spam Assassin rules work like casino gambling. Take roulette for example. Of 100 people play it once there will - on the average - be 47 winners and 53 losers. But - if they keep playing over and over the 6% edge the house has keeps turning more and more of the players into losers. With Spam Assassin - each rule increases the overall accuracy of the system so that even if some rules aren't real accurate - all the rules together are very accurate.

There are a lot of things that are misunderstood about Spam Assassin. Some people think that if you user the word "Fuck" in an email or "Viagra" that your email will be rejected. This is not the case. There was a time when that was true - but it isn't anymore. The vast majority of Spam Assassin rules focus on the bayesian filter - the message headers - the structure of the message - and what the message links too. The links are scanned against a black list of sites spammers link to so that the spamming becomes far less effective It also is tied to the Razor database of spam signatures that are generated by spam being reported by thousands of servers and sharing the information. These cooperative lists allow everyone to benefit from cooperative reporting of spammers in a fast and automated way.



Conclusion

So - that's the basics of how it works without going into thousands of details. So - this is not just a server running Spam Assassin that has a black and white classification of spam and nonspam. It is a fully integrated system that is engineered to accomplish the task of getting you your email without losing any of it in the process.

Spam filters do not censor free speech. Spammers censor free speech. They clog up your email boxes and cause your real email to bounce. They fill your email accounts with so many messages that you hand delete good messages because they are sandwiched in between spam messages and you accidentally delete them. Not everyone has the time to spend hours every day deleting other people's free speech. Some of us would rather be doing other things. And - the listener has rights too and this is a tool that allows you to make messages go away that you don't want to see. And it does it the way you want it done.