Profiling and performance-tuning your Hadoop pipelines
In the Hadoop ecosystem, there are now several tools which allow developers to quickly produce pipelines of MapReduce jobs without descending to the verbose level of the Java MapReduce apis. Unfortunately, these concise, higher-level tools often produce pipelines which are initially slow, and difficult to optimize. This talk will describe Etsy's pipeline of hundreds of Cascading flows (and thousands of daily Hadoop jobs), and our approach to profiling and performance-tuning them. Concrete examples will include speeding up our initial log parsing by 10x, streamlining our serialization and deserialization, and producing so much JVM snapshot data from our Hadoop jobs that we needed more Hadoop jobs to summarize it all.
Watch the video of Aaron Beppu's talk here.