Real Time Datamining and Aggregation at Scale


Real-time applications have long been considered off-limits for Hadoop clusters and Hadoop is often considered key to open-source exploitation of really large data streams. This talk shows how Storm and Hadoop can work together to achieve latencies of less than 5 ms typically and less than 5 seconds almost certainly can be achieved for a sample metrics application while still retaining years of data with high availability and durability. This is done using a hybrid system where Storm and Hadoop cooperate to do something neither can do alone. In addition, I will describe new machine learning methods that exploit recent mathematical advances to produce an extremely efficient real-time learning system performs near theoretical limits, but which is simple enough to explain in a few sentences. This system also uses a combination of Storm and Hadoop to provide real-time operation with durable history. This talk will provide a theory of operations, systems description and a demo of a live system. All code will be made available on Github.

Watch the video of Ted Dunning`s talk here.

Schedule info
Time slot: 
4 June 14:45 - 15:05
Experience level: 
Presentation Format: 
Short (20min)