Good question, and one that I am sure I don't have a complete answer for. But I have some thoughts on the subject based on working with customers over the years so here goes... In my experience the best IT metrics are:
- Aligned to business outcomes.
- Presented in a way that business users think.
- Inspire people to improve them.
Aligned to Business Outcomes
Well, duh! But it's amazing how many IT metrics that look like they align with business interests at first glance but turn out not to be so good. A good example of this is everybody's favorite: up-time: especially when expressed as a percentage. Now clearly 99.99% availability is better than 99.9% availability, or even 99.98%, and better numbers are worth striving for. But as an indicator of business value, these types of numbers are of limited value.
The problem is, in this example, that the business impact of outages is non-linear. A short outage doesn't really hurt, but a long one does. As an example, let's take 99.9% availability - a number that many people would regard as two nines short of a decent service level. You can achieve 99.9% availability even if you have a 90 second outage every day. In most organizations, 90 seconds a day would probably count as irritating but not represent a huge impact. On the other hand a 45 minute outage every month, or a close on 9 hour outage once a year would certainly be noticed, even though they both represent the same overall availability.
Many IT metrics - number of transactions that fail to complete, time to problem resolution, number of incidents per user, number of bugs per line of code etc. - suffer the same problem. The impact is non-linear. Business only notice, and are impacted, by the long tail in these statistical distributions and these are the metrics that count.
Good metrics are the ones that respect this non-linearity. So to get good alignment we need to throw away the data that represents small failures and report only those that really have impact. So, in our up-time example, move away from 99.xyz%, and start expressing the impact in terms of something like number of outages over 3 minutes, 10 minutes, 30 minutes etc. Now I share the concern that this kind of metric is somewhat negative in it's presentation, but I would argue that that makes it (and others like it) a better-aligned measurement.
Presented in a Way that Business Users Think
It is a long-stated goal of the IT profession that, one day, IT services will be just like electrical power or telephone dial-tone: it will simply be there and users will take it for granted. Well I have news: users already do take IT for granted and assume it will be "just there" when it's needed. In fact, better than that, as I discussed above, there's even a tolerance for a certain amount of failure. Now the level of tolerance has it's limits obviously and at some point business does start to suffer, and the exact point where failure becomes a real issue differs from application to application, business to business, and from user to user. But all users have one thing in common: they think of things in terms of failures, not successes. IT, by default works most of the time: what you need to measure is how far away from most of the time you are getting.
There's nothing strange going on here; we all think like this. If I ask you if you have been happy with your cellphone service this month, you will immediately try and recall if you had any problems. You may say you are unhappy based on how much failure occurred and the impact it had on you at the time, or you may say everything was basically okay despite a few dropped calls. Again, circumstances and expectations differ, but the point is its all based on thinking about problems: when asked this question you will not do is think about the 274 calls that went through without a problem.
So to get better aligned, throw out the "success" data and report problems. Nobody cares about success, they take it for granted. So provide only metrics that focus on your failures: not only will these types of metrics garner IT more respect for being honest about it's failures, it will provide numbers that much more realistically reflect how well, or otherwise, IT is supporting the business.
Metrics that Inspire
The third characteristic that a good metric has is that it should inspire people to take action to improve it. Some metrics are better than others at this.
I have long thought that it is only a certain type of quality wonk to gets excited about going from 99.9% to 99.99% to 99.999%. Numbers like this don't typically inspire. They don't inspire IT folks to get better, and they don't signal to the business that things are getting better. The same thing goes for getting from 5-sigma to 6-sigma: the improvement gets lost in the statistics: nobody gets what the difference between a 5.3 last month and a 5.6 this month means. It could be huge in terms of business benefits but it looks tiny. We're human, bigger, simpler numbers are a lot easier to think about.
So, again, a good approach is to drop the statistics and go for the raw numbers, something I talked about in my post on the transaction factory model. So instead of 99.99% successful completion of transactions this week, report it as 1,352 failures. Getting that number to zero is much more inspiring than moving some statistical indicator incrementally upwards. It's also much easier for the business side of things to see improvement (or otherwise) with simple numbers.
Of course these numbers perhaps look worse on the outside than overall statistics. There's something more comforting about reporting 99.99% success than saying you had 1,352 failures last week. But a good metric should prompt action, and action often only follows if you get out of the comfort zone. And, who knows, if you explain to your users exactly who bad things actually are, they may just start understanding why you keep asking for additional funding!