By Stephen Smith, IBM Redbooks
Data. It comes in many forms and from practically everywhere and contains everything from the mundane to the critical. Digital pictures and videos, emails, medical, financial, and purchase transaction records, posts to social media sites and blogs (like this one). Every day, we create 2.5 quintillion bytes of data. Data that, for the most part, never goes away and continually multiplies and will continue to do so.
Consider this: 90% of the data in the world today was created in the last two years. Hence the term big data.
If we are to work with big data, and ultimately capitalize on it, we need to first understand it. Big data spans 3 dimensions:
Volume. Big data comes in one size - large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information.
Velocity. Often time-sensitive, big data must be used as it is streaming into the enterprise to maximize its value to the business.
Variety. Big data extends beyond structured data, including unstructured data of all varieties, including text, audio, video, click streams, and log files.
Beyond the challenge of simply handling big data is the opportunity to explore new and emerging types of data, to create agility in data handling, and provide solutions for managing large volumes of structured and unstructured data.
Consider big data as the newest natural resource.
As part of IBM’s big data platform for handling big data, IBM InfoSphere Streams allows you to mine, refine, and deliver data for enhanced business value, all of the time, just in time.
IBM InfoSphere BigInsights brings the power of Apache Hadoop to the enterprise. Hadoop is the open source software framework that is used to reliably manage large volumes of structured and unstructured data. InfoSphere BigInsights enhances this technology to withstand the demands of your enterprise, adding administrative, workflow, provisioning, and security features, along with best-in-class analytical capabilities from IBM Research. The result is a more developer and user compatible solution for complex, large-scale analytics.
Using BigInsights, organizations can run large-scale, distributed analytics jobs on clusters of cost-effective server hardware. Large data sets are broken into chunks and processed across a massively parallel environment. When the raw data is stored across the nodes of a distributed cluster, queries and analysis of the data can be handled efficiently, with dynamic interpretation of the data format at read time. The bottom line is that businesses can finally embrace massive amounts of untapped data and mine that data for valuable insights in a more efficient, optimized, and scalable way.
Hadoop has 2 key components:
Hadoop Distributed File System (HDFS)
HDFS is the file system in which Hadoop stores data. HDFS provides a distributed file system that spans all the nodes within a Hadoop cluster, linking the files systems on many local nodes to make one big file system with a single namespace.
MapReduce is the distributed computing and high-throughput data access framework through which Hadoop understands jobs and assigns work to servers within the BigInsights Hadoop cluster.
To facilitate easy sizing, the InfoSphere BigInsights reference architecture provides 3 predefined configurations:
Full rack (shown to the right)
Each of these configurations can be further customized to optimize performance. Each configuration consists of these hardware components:
IBM System x3550 M4 servers as management nodes
IBM System x3630 M4 servers as data nodes
IBM System Networking RackSwitch switches for networking
These servers with Intel Xeon E5 processors provide the performance needed for such configurations.
The InfoSphere BigInsights reference architecture also provides a predefined configuration for HBase, the specialized database that is implemented within the Hadoop environment and is included in InfoSphere BigInsights 2.x.
The IBM Redpaper, IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights Reference Architecture, REDP-5009 provides an overall assessment process that helps you:
Discover the client’s technical requirements and usage (hardware, software, data center, workload, user data, and high availability).
Analyze the client’s requirements and current environment.
Propose big data solutions based on IBM hardware and software.
Additionally, this reference architecture provides everything you need to know for implementing a BigInsights solution, including:
Cluster node, data node, and networking configurations.
Deployment considerations, such as scalability and availability.
Customizing the predefined configurations for maximum availability, performance, cost effectiveness, ingests rates, and storage capacity.
Information on ordering the equipment for the predefined configurations.
As I said earlier, handling big data isn’t just a challenge, it’s the perfect opportunity to harness and reap invaluable benefits from this vast resource. And IBM InfoSphere BigInsights combined with the power of Apache Hadoop provides you with all of the tools you need to do it.
Download the reference architecture from the IBM Redbooks site: IBM System x Reference Architecture for Hadoop: IBM InfoSphere BigInsights
For more information see:
Stephen Smith is a Senior Technical Writer at IBM Redbooks. His background in IT technical writing extends over 25 years, during which has he authored over 150 publications that cover a wide assortment of technologies, most notably IBM products and services.