|System z on Facebook
The Journey of IBM zAware
Caroline Exum 270004MPQK firstname.lastname@example.org | | Tags:  zenterprise appdev sm im systemz security bao erin_farr zec12 development zaware systems | 8,368 Visits
By Erin Farr, Senior Software Engineer and Development Team Lead for IBM zAware
IBM zAware did not take the usual journey to become a product. No customer said, "I want a self-learning, analytics, out-of-band, monitoring solution to help me pinpoint what's anomalous in my z/OS systems." Instead, it all started with some complex problems.
System z environments are already incredibly resilient and very good at detecting failures and correcting or recovering from them. As systems and workloads are becoming increasingly complex, when problems do arise, the amount of data available to a customer to diagnose problems is vast and unwieldy, which lengthens the time to restore service delivery.
IBM did a deep dive into customer Problem Management Records to study customer problems that impacted them the most. A complex set of failures emerged, each of which was unique, often rare, sometimes transient, and usually involving one or more components that were behaving somewhat abnormally but not failing outright. For example, one might see a single service delivery failure, that by itself is not a problem but in conjunction with another seemingly benign software aberration does become a problem. The characteristics of such problems also make them highly difficult and time consuming to diagnose.
It was time to take a new approach to reduce the impact of this specific class of problems.
So, what do you do when you have a system that generates vast instrumentation data (for example, OPERLOG, SMF/RMF data), when you own the whole System z hardware, firmware and operating system stack and you have researchers creating algorithms to extract the relevant information from large quantities of data?
You put them together, of course, and build more autonomic capability and machine learning into the System z platform to help with detecting and diagnosing unusual behavior. I say it as though the story ends there, but the journey continues. What data should we analyze? What kind of analytics should we use? How do we not adversely impact the system we are monitoring if heavy calculations are required? Will the analytics keep up with the vast amounts of data? The business team wanted a proof of concept.
So we started with what works. We discovered the IBM Haifa Research lab had some proof points for use of machine learning to detect anomalous log information on Power Systems and System x platforms, in a post-processing manner. We worked with Haifa to re-purpose and use that technology in a real-time environment. We could use their message pattern analysis against z/OS OPERLOG data, learn what is normal for a z/OS system and highlight to the system administrator, in near real-time, what messages are unusual for that specific z/OS system. Using machine learning to tackle this problem provided an additional advantage. Any applications, including in-house or third-party applications that write well-formed messages to OPERLOG, would be included in this analysis.
We still need to be able to detect the anomalous behavior without adversely impacting the z/OS system that we are monitoring. To address this, the analytics would run "out-of-band", i.e. outside the system we are monitoring. While many alternatives were discussed regarding where IBM zAware would live, (e.g. SE, CF, LPAR, others), the most advantageous was to model the Coupling Facility and create a new partition type, "ZAWARE." This partition would be used to run the analytics and serve up the results through Graphical User Interface (GUI). Packaging IBM zAware as a firmware parition allows for minimal configuration required by the user. No knowledge of the underlying Operating system or software stack is required. All customer interfaces are through a GUI which is started automatically upon partition activation. This "stand-alone" partition direction also gives us the benefit of being able to hook into data from anywhere on System z moving forward.
Now we just need to to get the data over there. We wanted to use the existing architecture and interfaces whenever possible and not require customers to install clients to send data to IBM zAware. Therefore, we decided to hook into z/OS Logger, which already has OPERLOG data, and with minimal configuration, can now send it to IBM zAware. Use of z/OS Logger also allows us easy access to other z/OS data types (for example, SMF records) moving forward.
We also had to decide how to integrate IBM zAware with other monitoring tools in a customer's environment. We did not want the alerting functionality to be separately managed and configured in customer environments, that is, be "one more thing" to configure. Higher level service monitors, such as Tivoli or third-party vendor products have a view across the entire enterprise, for which IBM zAware provides one data point. We recognized that a customer might want an integrated viewpoint, rather than rely on multiple, separate tools. Therefore, analytics results will be available as XML data to higher level service managers where alerting rules can be created.
We created a software proof of concept in early 2009, streaming data real-time from our internal Integration Test z/OS systems, to ensure the analytics could keep up with the vast amounts of data. Because we wanted to verify that the Haifa technology would accurately pinpoint anomalies in System z MVS console logs, we took an unusual step and, under the appropriate agreements, worked with a set of customers to acquire their formatted OPERLOG data, so we could test the analytics using real production data.
Next, we had to turn our software proof of concept into well-behaved firmware, with a GUI that automatically starts upon partition activation and hopefully accounts for every scenario a user would need. Now that IBM zAware has become generally available, I'm looking forward to hearing feedback, requirements, and thoughts about where this technology should go next.
I'm excited to have had the opportunity to work on this product, not just because during development of the proof of concept I had so few meetings and got to code all day, but because I believe this is an emerging direction in systems management. With IBM zAware, we are taking analytics and applying it to a new domain, system instrumentation data, for purposes of system availability and resiliency.
This is part one of a four part series on IBM zAware, a product that was introduced with the zEnterprise EC12. Stay tuned to the Mainframe Insights blog for more on zAware!
zAware Installation and Startup (Angela Fatzinger)
Top 10 Most Frequently Asked Questions about IBM zAware (Aspen Payton)
Erin Farr is a Senior Software Engineer and the Development Team Lead for IBM zAware. She has 15 years of System z experience spanning z/OS UNIX Operating System development and porting, networking security, virtualization and availability management. She enjoys playing basketball and traveling.