The BA Computer Failure

The recent failure of British Airways computer systems in the UK shows the absolute importance of IT systems to an airline.  It also highlights many questions that still remain unanswered. 

It was outlined relatively early that it was a power failure and then in an interview today with the Chief Executive of BA on BBC news he said it was a power surge and that the secondary systems did not start. 

That raises even more questions such as-

There is obviously a single point of failure in their systems.  Has it been identified and what are they doing to address it?  These single points of failure should have been previously identified and addressed, by providing alternate power paths to the critical equipment

If it was a power surge, the question has to be asked, that how come a power surge would take out all of BA’s systems.  Power surges are not unknown and should have been planned for.  Most, if not all high-quality UPS systems have surge filters built into them.  If the area where the data centre or critical equipment was located is known for power fluctuations, then there should have been quality filter and surge equipment put in.  Also, the design of the data centres should be that there is alternate paths to them, right from the main power supply, through the switchboards and UPSs to the servers and switches which should have dual power supplies.

How did the surge occur?  If it was from the electrical supplier then how come it was not a wide spread incident in that country.  Was the surge caused by a person working in the data centre?  If so, were they qualified to work in the data centre, was the work scheduled and approved by IT Management? 

Then the issue of BA’s disaster recovery centres needs to be considered.  The assumption here is that BA, does have secondary sites.  Why did they take so long to switch over to them?  When was their DR systems and fail over to them last tested? High need operations such as airlines need to almost instantaneous switch over ability between their prime and secondary sites.  Worst case scenario it should take an hour. 

Many questions need to be answered.  We find it extremely hard to work out how a power surge could take out the whole operations of an airline.  There has got to be resilience built into it. BA is the national carrier of the UK and the government needs to step in and get an independent inquiry carried out by IT professionals who are well aware of how to build resilient data systems.  Something is terribly wrong in the design and the operation of their disaster recovery.  The cost of building a resilient data services would have been far less than that cost of the BA IT failure.

Resource

Standby are able to carry out your Data Centre Risk Assessment