Blackberry Downtime – What Could They Do Differently?

Posted by under Technology, on 14 October 2011 @ 10:33pm.
Blackberry Outage
Blackberry Outage, October 2011

No doubt you have heard about the recent Blackberry outage this week that spanned over the course of 3 days, Monday 10th to Wednesday 12th October. As a Blackberry user I was hit, although not as badly as some people given that I don’t use it for business purposes. Despite that I still got annoyed over the reasons why it went wrong, as it just shouldn’t have happened the way it did. Even RIM themselves acknowledge this and admit that a lot more could have been done to prevent it. I know for certain they will be looking into this to prevent future occurrences.

So what exactly stopped working? Everything. All e-mail, internet, Blackberry Messenger and everything else data related such as Facebook/Twitter apps. The reason for this is because all data is routed through the Blackberry servers rather than only some data being routed there. It wasn’t until the outage that I realised all data went through the BB servers as I thought browsing went through your providers network as usual. Wifi browsing still worked fine, which is great at home but useless if you’re away from a wifi network. What also annoys me about BB’s is that you can’t use e-mail unless it goes through RIM’s servers. I’d prefer it the way Android allows it where your standard data connection is used and your phone handles it all but I guess that’s what’s unique about the BB service.

I spent a little time trying to figure out if there was a way around this issue with regards to browsing and using the providers network as normal (in my case O2). I tried manually inputting the APN’s but this did nothing. I did discover that tethering still worked as intended which was good to know as the only time I use my BB a lot is when away on business (which is rare in itself).

Over the 3 days I found myself looking at the same news article over and over again, explaining what was happening and when it would be resolved. It was all the same; “It’s being worked on”. That was it, there was no detail at all. At that point I don’t know if they knew what the issue was or whether they were trying to avoid telling people. Regardless, in the end it all spills out and the issue was flagged down to a “faulty core switch”.

Immediately at that point alarm bells started ringing on why the system didn’t have redundancy. It was supplying a service to 10’s of millions of people around the world, surely there was some form of backup? Well according to RIM there was, except it didn’t work “as previously tested”. Fact is, you can’t simulate a failure like this. It’s not like the switch drops off the network and another takes over instead, sometimes the switch fails in a way you can’t predict, causing a flurry of data that causes chaos throughout the software. This itself can result in the fail over not working as intended because it’s not capable of detecting it as an error. I don’t pretend to know how these switches work but that’s my interpretation of articles I read about fail over networking.

So back to the title of this post, what could they do differently? Lets start with the basics of data access. There needs to be some form of fail over for data services. I can accept that because e-mail is controlled by the BB servers that this will be offline, but how about routing internet traffic temporarily through your providers standard network? This at least gets people online and able to check their e-mail via webmail for example.

Secondly the infrastructure itself, as I heard it described, is too centralised. It needs to be spread out over each country and have each country use that infrastructure. This outage took down practically the whole of Europe, plus more. Even USA and Canada were seeing outages. By splitting it on a per country basis they prevent the outage spreading too far, limiting damage and hopefully reducing the recovery time. You could go further and split this into multiple data centers in each country as well but this starts to get costly, so lets just consider 1 data center per country with your fail over being fed into another nearby country’s data center. This would be ideal, IF the fail over works properly if a switch fails.

It all sounds simple but I know it most certainly isn’t. I’d like to think myself technical enough to understand the core levels of networking but I’m not Cisco certified nor am I anywhere near it. RIM can’t afford another outage like this as their revenue is already falling. Disgruntled customers make this even worse when their service is out for 3 days so any and all attempts to avoid this are crucial.

So to summarise, what I think RIM need to do as a part of their strategy is:
1. De-centralise their infrastructure to multiple countries.
2. Fail overs should be another nearby countries BB data center.
3. Formulate a fail over for data services (browsing, etc) by using the providers standard network in an outage.
4. Allow users to choose if their e-mail uses BB’s servers or standard IMAP like Android devices do.
5. Keep users informed if an outage occurs!