Taking Adaptive MapReduce out for a spin
IBM InfoSphere BigInsights 2.1 was recently released and is chocked full of features. Among these are things like BigSQL, GPFS FPO, improved reliability and more.
As a Platform Computing person, the most exciting new feature to me is the introduction of Adaptive MapReduce - essentially a stripped down IBM Platform Symphony scheduler included as part of the BigInsights Enterprise Edition. While a lot of the cool functionality in Symphony requires that the Platform Symphony Advanced Edition be licensed, Adaptive MapReduce has the potential to significantly boost the performance of many types of Hadoop workloads.
This weekend I finally got around to installing BigInsights 2.1 to kick the tires myself. When enabled at installation time, Adaptive MapReduce installs Symphony in place of the Apache Hadoop scheduler. Last year we had an external auditor benchmark Platform Symphony and showed a 7X performance advantage with some real-world workloads, and I was keen to see whether the performance advantages would be as compelling in the latest BigInsights software.
I installed two (single node) BigInsights clusters in separate VMware virtual machines. The two nodes were identical in terms of memory, disk and vCPUs. On the first cluster I installed BigInsights with the standard Hadoop scheduler (the top window in the video below) and on the second identical cluster I enabled Adaptive MapReduce with everything else configured identically. I then proceeded to run a sample Hadoop application from a single console with windows opened to each test cluster.
This test used by Cloudera and others specifically stress the Hadoop scheduler, and we know from other benchmarks that being faster in raw scheduling translates into performance gains with real workloads as well. This test shows that IBM BigInsights with the Platform Symphony scheduler turned on literally runs circles around the open-source Hadoop scheduler on which competing Hadoop distributions are based.
The great news for existing IBM customers is that this new capability is included at no additional cost when customers upgrade to the latest version of BigInsights Enterprise Edition. When you can run workloads faster, this also means that the same workloads can potentially run with less hardware potentially enabling better service levels at a reduced cost.
The demo is self explanatory I think. In the upper window I run a Hadoop job comprised of 200 tasks on the regular Hadoop Scheduler. In the lower window I submit exactly the same job on an identical BigInsights cluster with Adaptive MapReduce (Symphony) turned on. After Symphony completes several jobs while the the open source scheduler is still running, trying to get Symphony to break a sweat, I submit a larger job comprised of 2,000 map tasks - ten times larger than the job submitted in the upper window. Even this larger job is done in a small fraction of the time required of the workload on open source Hadoop.
The nice thing about this test is that anyone can easily replicate it and see the advantage themselves. Stay tuned for another blog post with the step by step recipe.
I was thinking that Symphony is kind of like a nitrous-oxide booster for your Hadoop cluster, but I realize now that this analogy doesn't really hold.
NO2 boosters make your vehicle run only marginally faster for a short time and can cause engine damage. Symphony makes your cluster run dramatically faster all the time with no wear and tear, and you might even be able to use a smaller more fuel-efficient engine. For some types of applications, turbo-charing the application with BigInsights and Adaptive MapReduce is clearly a no-brainer.
The opinions in this blog are mine alone and do not necessarily reflect the views of IBM.