London air traffic failure: NATS' lessons are not blue-sky thinking
Last week's air traffic control outage has both lessons, and reassurances, for critical IT systems.
Inside the Enterprise: Late last week, airspace over southern England was strangely quiet.
For several hours, aircraft movements were restricted, following problems with the air traffic control system run by NATS, at its Swanwick centre.
The failure was the result of an issue with the air traffic IT system, and specifically, a system called the Flight Data Controller.
According to reports, the problem was caused by just a single line of code, which has now been fixed. But the knock-on effects, in terms of disruption to travellers and to airlines, could turn out to be extensive.
NATS has already said that it will face financial consequences, as a result of the outage; NATS will refund air traffic control charges to airlines, and the airlines themselves will do doubt face compensation claims from irate passengers.
NATS, though, has also announced an independent inquiry into the outage on 12 December, including an examination of the root causes and whether the system had enough, in-built resilience. In particular, the company will examine whether there need to be "further measures to avoid technology or process failures in this critical national infrastructure and reduce the impact of any unavoidable disruption".
It has suffered problems before, in particular an extensive outage in 2013. The inquiry will also ask whether NATS had fully learned the lessons from that incident.
However, at one level inconvenient though delayed flights are NATS' systems worked exactly as they should. Airspace was not, in fact, closed but aircraft movements were restricted: a necessary safety measure if controllers do not have all the information they would usually use, to control flights.
The key to a system such as air traffic control, where lives are at stake, is for fail-overs to backup systems to act smoothly and seamlessly, and if a system's performance does degrade, for that to happen in a controlled, rather than sudden, manner. The actual outage at NATS was just 45 minutes long, even though the disruption inevitably lasted rather longer.
NATS' experience, in fact, provides three lessons for anyone running mission-critical IT.
The first is to have systems that can handle failure, or if they do fail, for performance to "degrade gracefully". This, of course, is an engineering issue, and one that also requires investment.
The second is the importance of communications. Maintaining a flow of information to customers, shareholders and other stakeholders is vital. Rumours spread at the speed of social media, and a clear public communication channel, including making senior managers available to the media, is essential if an organisation is to stay in control.
The third step is, after the outage, to investigate the causes, and act on the findings.
With its announcement of an independent inquiry into the problems at Swanwick, NATS has shown that it is, at the very least, following these three steps. And, with businesses running critical infrastructure facing growing scrutiny by regulators, all industries should draw lessons from this latest, IT-related outage.
Stephen Pritchard is a contributing editor at IT Pro.
The essential guide to cloud-based backup and disaster recovery
Support business continuity by building a holistic emergency planDownload now
Trends in modern data protection
A comprehensive view of the data protection landscapeDownload now
How do vulnerabilities get into software?
90% of security incidents result from exploits against defects in softwareDownload now
Delivering the future of work - now
The CIO’s guide to building the unified digital workspace for today’s hybrid and multi-cloud strategies.Download now