Big data is the now ubiquitous buzz word for anything from a few terabytes in certain contexts, but more often to a single to dozens of petabytes. So what is Hadoop and how does it help us with these data volumes?
Hadoop itself is a framework that includes particular data-centric, distributed apps such as Hadoop Distributed File System (HDFS) which supports up to 30 petabytes (essentially functioning as a means of data access across potentially thousands of nodes). This is good news considering that the amount of data created (and replicated) in 2011 alone is estimated somewhere under 2 zettabytes (2 trillion gigabytes).
At its core, Hadoop is an open source MapReduce implementation (see definition below). It was originally funded by Yahoo, emerging in 2006, reaching a level of maturity for the scale of the web only by 2008.
The main open-source, distributed apps in the Hadoop framework are:
- MapReduce (in wikipedia) - Considered the core of Hadoop, it enables parallel computation on potentially thousands of servers clusters (Created at Google)
- Pig - High-level programming language for Hadoop computations
- Hive - Data warehouse with SQL-like access
- HDFS - Distributed redundant file system for Hadoop (means of distributed data access)
- Flume - Collection and import of log and event data
- Ambari - Deployment, configuration and monitoring
- HBase - Column-oriented database scaling to billions of rows (derived from Google's BigTable)
- HCatalog - Schema and data type sharing over Pig, Hive and MapReduce
- Mahout - Library of machine learning and data mining algorithms
- Oozie - Orchestration and workflow management
- Sqoop - Imports data from relational databases
- Whirr - Cloud-agnostic deployment of clusters
- Zookeeper - Configuration management and coordination
More background reading:
- This is the best overview article I have found and the source for the above list - http://radar.oreilly.com/2012/02/what-is-apache-hadoop.html
- Wikipedia is always a good source as well - http://en.wikipedia.org/wiki/Apache_Hadoop
- It's great to get our heads around this because frankly it is one of the truly big and exciting things happening in the tech world. See ZDNet's "Big Five" Trend for next half-decade.
What is IBM doing around Hadoop and Big Data in general?
- In 2010 IBM chose to embrace the apache open source Hadoop (IBM jStart Hadoop overview) code base to create our own commercial implementation of Hadoop (think Red Hat Linux approach to open source, or IBM Symphony with OpenOffice), launched in 2011 under the banner of IBM BigInsights (within the IBM InfoSphere platform). This incorporates the incubated jStart BigSheets project, which uses IBM Cognos ManyEyes for visualization, among other options.
- See how InfoSphere BigInsights uses Hadoop.
- BigSheets YouTube demo.
- InfoSphere Indentity Insight (formerly Entity Analytics) is another area where IBM was way ahead of the curve in making sense of Big Data. Think "six degrees of Kevin Bacon" and resolution on name variations etc. Here is a recent article I re-tweeted where IBM Distinguished Engineer and Guru Jeff Jonas that discusses sensing in real-time to take action in time.
- InfoSphere Streams is a related technology for fishing actionable insight from massive streams of data.
- Ultimately these Hadoop-based and other technologies are a Big Data complement and extension to the rest of the IBM Analytics stack including Cognos, SPSS and Content Analytics.
- IBM was just identified by Forrester as one of two clear leaders in offering and strategy in 2012 (Amazon Web Services was the other, which I understand ironically uses IBM cloud technologies... more coopetition!). Here is a great internal article on w3, and here is the Forrester article and chart. Lastly, here is a great list of IBM Big Data success stories.
- Lastly, though not specifically Hadoop-related, the Netezza high-performance DW appliance is another important part of the IBM Big Data offering, enabled by its industry-leading, patented Asymmetric Massively Parallel Processing (AMPP) technology.
What are others doing and saying?
- Oracle and Microsoft have jumped into the fray as well as you might imagine:
- Oracle partners with startup Cloudera to provide a big data appliance with Hadoop installed. They are bundling hardware and software with Cloudera offerings, building on past announcements around Exalytics which was (is?) to support Hadoop as well.
- Microsoft drops its competitive technology Dryad and wants to extend Windows Server with Hadoop. This is obviously huge in cementing Hadoop's place in our future too. Hello VHS and goodbye Betamax. OK you may have to be over 30 to chuckle at that one.
- SAP seems to have staked its future on HANA as its database and as its big data strategy. It is in-memory mostly, but includes in its framework everything from standard relational, OLAP, unstructured and big data in general. In its recent roadmap announcements, they also announced the likely extension of HANA with Hadoop.
- Along with cloud startups in general, Hadoop-based tech startups are fast becoming the new darlings of the venture capital community. Here is GIGAOM's list of five startups that could change the face of big data, and these businesses are focused not on Hadoop infastructure like Cloudera, MapR, Hadapt (and many others) but rather on business application use cases.
- This is worth reading too: GIGAOM's six reasons 2012 is the year of Hadoop.
Get started with Hadoop:
- If you are a real techie, check this getting started guide from Apache.
- If you are slightly techie and have access to a Linux system, you can download IBM BigInsights for free to try out.
- Don't forget this BigSheets YouTube tutorial mentioned above.
- Here are some fun Hadoop visualizations:
- What does Hadoop look like as it is at work across thousands of distributed servers querying petabytes of data? Check out this 3D animated visualization (streams going straight up are transfers in/out of internet).
- Here is an IBM ManyEyes Tag Cloud visualization of a somewhat techie bunch of java code.
Until next time- happy Hadooping...
Kyle McNamara a.k.a. 1in400k
Easy URL: McNamara.me/techblog
Find or follow on: