For data center managers, high-performance computing environments pose special challenges. They are usually built with state-of-the-hardware, they evolve rapidly, and the users of such environments are not only very smart – they’re often demanding as well.
In both research and commercial HPC centers, there are big incentives to maximizing efficiency. Running an efficient center is important for many reasons:
An old adage around management effectiveness is that “You can’t manage what you can’t measure”. It turns out that measuring efficiency in HPC centers is a tricky business. While organizations may employ excellent tools for discrete tasks like cluster management, user management or workload management, the tools used are often “siloed” capable of monitoring only what parameters they have visibility to.
In HPC environments, managed entities like users, groups, resources, workloads, software licenses and projects are interdependent and need to be gathered and considered together. Failing to do so can lead to erroneous conclusions, or make problems almost impossible to troubleshoot.
There are countless similar examples, but just to provide a couple:
Improving efficiency requires this type of multivariate analysis where many factors are considered together.
For centers running IBM Platform LSF, IBM Platform Computing provides exactly these types of tools.
Real-time, operational monitoring – IBM Platform RTM is a real-time dashboard for Platform LSF environments that monitors on or more clusters and provides extensive reporting, tracking, monitoring and alerting capabilities. With Platform RTM users can identify and resolve problems faster, remove bottlenecks affecting efficient, reduce administrator workload and improve server levels to users.
Analytic tools, trending analysis – Complementing the capabilities above is a related tool called IBM Platform Analytics. Rather than providing real-time information, Platform Analytics is a relational OLAP and visualization tool that gathers vast amounts of information about the operation of the HPC environment over time. Using Platform Analytics, analysts can answer business level questions about their operations and simplify tasks like customized reporting, capacity planning, and usage-based chargeback accounting. A case-study available at the IBM Platform Computing web-site explains how a pharmaceutical company was able to monitor SLAs and ensure that R&D communities were being properly served, while improving user satisfaction by providing visibility to workloads and resources across multiple data centers.
Both of these tools provide the ability to gather and correlate information related to users, groups, queues, jobs, hosts and applications.
When it comes to improving efficiency, tools matter. Only being able to monitor and consider information holistically can administrators have a clear view of their operational efficiency and how it can be improved.
Click here to learn how to improve the operational efficiency of you analytic environment.
You can also view related seminars in our series.