Scalability Challenges in Big Data Science
Scaling complex data analysis applications has become one of the hottest topics in Data Science and Big Data in recent years. We want to perform more and more complex analysis methods on larger and larger data sets. To achieve this, we need to bring together methods from computational statistics and machine learning with scalable technologies like NoSQL databases, stream processing, map reduce frameworks, or concepts for concurrency like actors. In practice this is often anything but trivial as both fields have quite different backgrounds. In this presentation we will talk about these challenges based on our experience with real-time social network analysis at TWIMPACT, and also in the broader context of machine learning methods in general. We will discuss how concepts like eventually consistent data stores, map reduce, or stream processing relate to the requirements of machine learning methods, how the issue of scaling is usually addressed in machine learning, and discuss what the common ground is and what the issues we are currently facing.