An interesting question was recently raised by one of my colleagues. We know that the IBM eX5 servers with Intel Xeon processor E7 family provide advanced RAS capabilities (in addition to leading scalability and performance). We also know that the traditional dual-socket Intel Xeon processor-based servers offer standard RAS features. The question was what is the actual difference between standard and advanced RAS, and, more important, can advanced RAS features really help to achieve better availability and to what extent of so? In other words, can we somehow quantify the benefits of the advanced RAS features? What are the measurable outcomes that we can expect to see?
To answer these questions, we conducted some research and found an approach to compare advanced and standard RAS features, and express the results in a measurable way. I'd like to share key aspects here, and details can be found in the Reliability, Availability, and Serviceability Features of the IBM eX5 Portfolio Redpaper.
First of all, let's have a look at the IBM approach to the hardware RAS strategy. The intent of 24x7 system availability is to reduce the impact of the hardware failures on system operations. IBM traditionally classifies hardware failures in multiple ways:
Repair Action (RA): RAs are related to the industry standard definition of Mean Time Between Failure (MTBF). A RA is any hardware event that requires service on a system. Repair actions include incidents that affect system availability and incidents that are concurrently repaired.
Interrupt Repair Action (IRA): An IRA is a hardware event that requires a scheduled system outage to repair.
Unscheduled Incident Repair Action (UIRA): A UIRA is a hardware event that causes a system to be rebooted in full or degraded mode. The system experiences an unscheduled outage. The restart might include some level of capability loss, but the remaining resources are made available for productive work.
High Impact Outage (HIO): A HIO is a hardware failure that triggers a system crash that is not recoverable by immediate reboot. This failure is usually caused by failure of a component that is critical to system operation and is, in some sense, a measure of system single points of failure. HIOs result in the most significant availability impact on the system, because repairs cannot be effected without a service call.
The ultimate design goal for IBM eX5 servers as a part of the Enterprise IBM X-Architecture strategy is to prevent hardware faults from causing an outage. Part selection for reliability, redundancy, recovery, and self-healing techniques and degraded operational modes are used in a RAS strategy to avoid application outages.
Then, we developed a sample scenario to evaluate the availability of the scalable system expressed in number of repair actions of different types for the IBM eX5 servers (x3850 X5 scale up building block) and for the traditional 2-way servers (scale out building block).
As you can see, advanced RAS features make a difference. An x3850 X5 server requires more than three times less repair actions (RA) than a comparable scale out infrastructure. Among them, the eX5 scale up approach has four times less scheduled repair actions (IRA), performs five times less self-recovering actions by using degraded mode (UIRA), and encounters more than six times less unexpected outages that require manual intervention to repair (HIO).
What makes that difference? Key points to highlight include:
There are fewer components that are used to build the scale up infrastructure
The lower the number of components, the lower the number of overall failures that might happen in the infrastructure.
More extensive memory protection techniques.
Redundant bit steering increases the effectiveness of Chipkill.
Memory failures are the most common cause of system downtime, and the RBS effectively doubles the number of Chipkill actions that are sustainable per server.
Redundant processor-to-I/O hub connections.
Ability to self-recover from processor failure.
If the primary processor (the processor used for booting the operating system) fails, then an eX5 server can use a secondary processor to boot the OS, as the server still has access to the integrated I/O devices because of the redundant links between the processors and I/O hubs.
Two interconnected 4-way nodes create an 8-way building block.
Self-healing from a single-node failure.
Two interconnected nodes form a resilient 8-way configuration. If there is a single-node failure, the system can be restarted in degraded mode, thus eliminating HIO.
In addition, the scale up infrastructure often requires less network and fabric ports as well as connections to the electrical power infrastructure, therefore simplifying cabling and requiring fewer networking infrastructure building blocks.
For details, I'd recommend to read the Reliability, Availability, and Serviceability Features of the IBM eX5 Portfolio Redpaper.