IBM System x and Big Data – Optimizing IBM GPFS for IBM InfoSphere BigInsights
Interview with Phil Horwitz, IBM Senior Engineer- Systems Optimization Competency Center Workload-optimized Systems and Big Data Technical Lead
Beth: Hey Phil, it’s great to talk to you again. I’ve been thinking about having a follow up with you to the initial discussion we had on big data a few months ago. When we spoke on the phone recently, you mentioned the work you just completed around optimizing GPFS for BigInsights on System x. I thought readers would find it interesting.
Beth: First, what is IBM GPFS and what does it have to do with big data?
Phil: Hi Beth. I was thinking that this would be a great time to do a follow up as well. GPFS stands for General Parallel File System. It’s IBM’s clustered, high-performance file management platform that gives fast and reliable access to a common set of data. It provides online storage management, scalable access and can manage petabytes of data and billions of files. So basically, GPFS can manage enormous amounts of information, which as you can imagine, is critical for big data workloads. GPFS has been around for a long time. What’s interesting is that GPFS has this new feature called FPO, or File Placement Optimization, that BigInsights can leverage.
Beth: Let’s get back to FPO. Talk to me about BigInsights first. What’s original about it?
Phil: BigInsights is our software platform that extends the value of open-source Apache Hadoop. Its primary function is to help companies discover and analyze business insights hidden in large volumes of diverse data. This data includes things like log records, clickstreams, social media info, news feeds, e-mail, electronic sensor output and even some transactional data. A lot of times this data is ignored because it’s too difficult to process using traditional means. In addition to Hadoop, BigInsights includes quite a few useful technologies, including: Pig, Hive, HBase, Jaql, Lucene, Oozie, Avro, Flume, Hcatalog, Sqoop and Zookeeper.
BUT BigInsights is original because it increases the value of industry-standard Hadoop by including IBM-unique software such as:
A web console
A spreadsheet-like analysis tool
Application accelerators for social and machine data and performance features
Beth: Can you give me an example of how companies are using BigInsights on System x?
Phil: Sure. I’m part of an IBM team that’s working with a large automobile manufacturer. We’re looking at analyzing logs of machine data on problems that come from the sensors built into the cars. BigInsights is able to analyze that data to uncover trends to determine what went wrong. The idea is to use the data to improve the car’s design and/or make improvements to development or manufacturing, which will alleviate the problem and avoid potentially critical customer situations. So with BigInsights on System x, this company is able to better understand trends, react quickly and develop a better product.
Beth: How do GPFS and BigInsights work together?
Phil: I mentioned the new GPFS feature called FPO a few minutes ago. FPO keeps track of the location of data and can tell BigInsights where to schedule a particular job within the BigInsights System x cluster. Basically, it moves the job to the data as opposed to moving data to the job. Let me give you an example. Say I have 20 servers in a rack and three racks. GPFS FPO knows a copy of the data I need is located on the 60th server and it can send the job right to that server. This reduces network traffic since GPFS- FPO does not need to move the data. It also improves performance and efficiency.
Beth: Can you talk about the optimizations you made to GPFS to help it work better with BigInsights?
Phil: Our organization, the Systems Optimization Competency Center, is focused on solution optimization. I loved working on this project because we truly looked at it from the total solution perspective. We not only optimized GPFS, but we looked at the OS, the software stack above GPFS and the System x hardware below it, including network and I/O, to create an efficient and high-performing solution. Traditionally, we would have used a sorting-type workload, but we used a complex YCSB HBase workload instead to stress the total solution and simulate a more realistic client scenario. We did this because as Hadoop matures, it’s supporting more and more complex workloads. Analysis of the optimizations we performed showed a significant performance improvement and these optimizations are reflected in BigInsights 2.1, which was recently announced.
Beth: Thanks Phil. I look forward to speaking to you again soon.
For more information on IBM BigInsights 2.1 with GPFS-FPO, click here.
Connect with Phil Horwitz on LinkedIn.
Phil was interviewed by Beth O'Shea - IBM Marketing Communications and Sales Enablement STG
Connect with Beth on LinkedIn.