Highly Scalable Discriminative Spam Filtering
Content-based email spam filtering remains a technological challenge. The commercial incentive for spam senders results in an arms race between filtering methods and spam obfuscation techniques. The main problem arises from the size of the problem: Email service providers and large companies have to filter millions or even billions of emails per day. Where these providers must employ filters which are accurate for emails of all sorts and languages, spam senders can exploit local weaknesses on one specific type of messages. Hence, spam filters should be trained on as many as available ham & spam-flagged emails which requires highly scalable discriminative classifiers such as support vector machines and logistic regression. In the last couple of years, several new algorithms were developed to make these methods applicable to train from hundreds of millions training instances with millions of attributes. In this talk I will review these methods and strategies in the context of email spam filtering.
Watch Michael Brückner`s video talk here.