Posts Tagged single point of failure
The one question that every CIO should ask themselves… What are you going to do when (not if) your cloud systems fail?
Frank started the conversation with this response to my tweet about Azure:
Frank: “Exactly the type of thing that reinforces CIO fears about cloud…”
Stuart: “working on the assumption that cloud outages are inevitable… I feel it’s how vendors respond that will give CIO’s confidence”
Frank: “No, fewer outages will give confidence…”
Stuart: “I’ll meet you half way… Fewer outages and proper service management around problems when they do happen…”
Frank makes the point that some of his CIO contacts were livid following this outage. And this is where this post really starts, as I challenged Frank as to exactly who they were livid at on the basis that to overall accountability for a company’s IT systems, whether they be on premise or in the cloud lies with the CIO.
Stuart: “as CIO you’re accountable for everything as you choose to use cloud or not!”
Alongside the Azure thread there was a parallel thread running on cloud security that had been started by Dennis Howlett in his Accman blog.
“Anything that connects to a network is vulnerable. That includes EVERY cloud player, regardless of the service they offer. What matters is the extent to which vulnerabilities exist AND are capable of exploitation.”
Let me share my belief here, these two topics are intrinsically linked, i.e. when you’re appointed as a CIO you’re trusted to deliver competitive advantage for your company through IT. Now, it doesn’t take a rocket scientist to work out that if you can’t maintain availability and adequate security of your systems then you’ll only manage to deliver disadvantage, and you probably won’t be around very long.
So, let’s get back to the title of the post… what are you going to do when your systems fail (which is inevitable)?
If you’re running in house, the apps themselves (if they are decent apps) are least likely to fail, more likely failures are from switches, disks, networks, cables and other parts of infrastructure. You protect yourself against this by designing your datacentre(s) around redundancy with zero single points of failure.
If you’re running cloud services, you pick a reputable supplier who works with a reputable hosting partner right? Well, yes but as we saw with Azure yesterday (and previously with Amazon and Rackspace and most other reputable cloud vendors) the same hardware failure points exist in cloud provider datacentres as they do in your own. If you appreciate and accept this this then you’ll also be mindful that you could be introducing a single point of failure in your enterprise platform and that your service availability is now at the mercy of their service availability.
When you running outside of your own bricks and mortar you also need a high bandwidth and high availability WAN, Firewalls and Proxies, etc that all need to be fault tolerant and designed around redundancy to ensure adequate access and security at all times. Even then you can’t mitigate around someone digging up the cable which has happened to me twice this year and is more common than you might expect.
Is this a story of cloud bashing? No it isn’t, it’s a story of how the CIO needs to take full accountability for managing risk within their platform.
- If you’re running mission critical systems and your business can’t afford any outage then you simply can’t design a single point of failure into your enterprise platform.
- If you’re running non mission critical systems, then you may choose to take a little more risk around availability and accept a single point of failure and manage any disruptions that may arise.
What you deem to be mission critical or not is your own decision and it doesn’t have to be one or the other. For my part I run a hybrid platform where some parts are mission critical and some parts less so and the platform design and location of services (in house vs. cloud) reflects this.
Of course from a customer perspective people outside of IT expect things to work 100% of the time and if you’re running either of the above, or a combination, then any outage no matter what damages your credibility with users.
So as an effective CIO, you need to design an effective platform around what your business needs, you need to manage the risk, you need to pick the suppliers that you work with, and you need to take full accountability when things go wrong. Yes you can get livid with your suppliers, but just remember who picked them and remember who chose to introduce a single point of failure into your platform in the first place.
So, what are you going to do when (not if) your cloud systems fail? Make sure you know the answer today.
Footnote: This post relates to large enterprise businesses and the role of the CIO and the point I’m trying to make is you have to plan for failure to guarantee success.
Part of this cross posted here