Learning the lessons of failure

As I write, Skype is down. Not just inaccessible because I am in the wrong place or on the wrong connection: The servers are offline. The conversation arising in the IT Pro office following this story has definitely piqued my interest not because I am that worried about the availability of Skype but because of the sheer enormity of the gulf between the way that public cloud services handle reliability, and the way internal IT departments do it.

I was chatting with Joe Baguley (EMEA CTO at VMWare) last week, while VMWare were opening their new UK HQ: he coined a phrase which I think illustrates the nature of the gap I am talking about, perfectly. He's had it, he said, with the term "Enterprise grade". How much lower is that, he said, than "consumer-grade?" Consumers have a view of reliability which is effectively infinite. An instant of downtime produces a storm on Twitter (so long as it's not Twitter that's down), within which people focus on the impact on them, far more than they do on the best approach to helping the service get back online as fast as possible. A depressing number of people think that this consists of shouting at uninvolved help-line operators, as if force of personality speeds up resumption of service.

Inside IT, the approach could not be more different. Here is where "Enterprise Grade" takes on more meaning and shows more detail compared to a consumer marketplace like Skype, Enterprises are now so reliant on their IT that the IT service guys have rights to announce downtimes, undertake scheduled maintenance, and even have prolonged outages in pursuit of major reconfigurations because the 100 per cent continuous availability option loses the business more money than just taking a few hours with the workforce or e-commerce sites doing no business.

I am finding that some enterprise-grade types are now taking an even more aggressive attitude. Where I used to try to obtain weekend working to take the lid off some core part of the infrastructure in a business, to avoid the two-hour delay while the humble IT support team begged and pleaded for outage time with the boisterous userbase, these days I am coming across a lot of "they will do something else while it's down": Users who have adapted to the simple facts of life around service reliability. If they get a popup that says "no email" then they just go and do something else.

Of course this requires quite a bit of adaptive system design before it's a reality. Some of it is about spreading work across multiple software tools and sites. Other enabling factors include a more tolerant attitude to storage and connections on the part of widely-used business tools, the most obvious of which is humble Microsoft Word. Putting a modern version of Word out in a business provides opportunities for reducing the effort required to be 100 per cent fault tolerant in the data centre, in ways that really have nothing to do with the simple functional brief of being a word-processor.

Which ought to leave us a very long way from Skype (still) being down but it doesn't, if you consider the possibly rash decision made by Microsoft to brand their mould-breaking corporate telephony product as "Skype for Business" instead of plain old Lync. One might possibly imagine that in the world of modern virtualisation and Microsoft's recent announcement of a Linux for Azure that "Skype for Business" might be an encapsulation of a Skype-for-consumer server, with a few bells and whistles plugged in: But it's not. S4B is a fully evolved Windows internal enterprise stack application. It needs a well maintained internal LAN and a properly configured Windows domain and a little sprinkling of other licences for parts of the Microsoft catalogue to run right. There are in fact only the most tenuous of connections between Skype out on the net, and Skype that falls out of the box onto your lap in an enterprise IT infrastructure team.

This might well mean that the internal enterprise product is, in reality, more resilient than the internet one, because as a corporate IT person you can make decisions about how to "run broken" and what constitutes the right investment to make to provide not just 100 per cent capacity 100 per cent of the time (a simple target that's remarkably hard to achieve), but rather a variable level of service, which is known to be restricted in difficult circumstances.

That kind of result is both more realistic, and more involving, than the kind of stiff, nothing-to-say communication which consumers seem to want from their critical, no-alternative, always-on internet services: A lesson here that IT can teach the internet world in both design for failure, and the right to be honest and transparent while failures are in effect. Especially when it comes to person-to-person communications services: If there is one lesson I have learned it is that IT traditions which come from the file-sharing, database-querying and even website parts of our business are no guide to the way that people react when something to do with telephony or person to person connections starts to act up. It's "emotions first, intellect if you're lucky" time.

Something which I suspect Microsoft are discovering for themselves, today.