Yesterday, the unimaginable happened – Facebook, Instagram, and Whatsapp went down simultaneously. While we all sat twiddling our thumbs, wondering how we can possibly interact with friends, family, and people we haven't seen since school, engineers behind Facebook’s servers were in crisis. So, what really happened?
The information we currently have is a cumulative result of leaks from people claiming to be "insiders’, a brief and ambiguous blog post published by Facebook itself, and a brilliant write-up by CloudFare, a web infrastructure company.
Understanding DNS and BGP
To those on the outside, Facebook simply looked like it disappeared from the Internet. Users were getting an error when trying to reach the website, and the servers were completely unreachable. For a company as well-established as Facebook, this is a seriously rare occurrence. We now know that this downtime was a result of a configuration change to the ‘backbone’ of Facebook’s routers, which send and receive data on networks. Communication stopped between data centers, and all their services stopped. This was then compounded by an unfortunately-timed error in their building’s card readers, which allegedly prevented employees from accessing the building and fixing the issue.
Let’s delve into the timeline and understand just what went wrong, at least from the outside. Facebook, much like every other website on the Internet, relies on advertising itself to draw people through to its site. To do so, the Internet uses Border Gateway Protocol (BGP). BGP is a mechanism that decides the routes data will travel across the Internet, much like a postal service decides how your mail would reach another country. Without BGP, the Internet falls, as it controls how all data communicates across networks.
Another integral part of the Internet is the Domain Name System (DNS). DNS is the Yellow Pages of the Internet; it translates complicated numbering systems into something we can read and recognize. For example, the Internet reads “126.96.36.199” (among others), but we read ‘www.facebook.com’ because DNS servers have kindly translated it – otherwise the Internet would be an unintelligible mess of numbers.
How these then work together is as follows: if you Google ‘Facebook’, it displays ‘www.facebook.com’ to the user. This has been translated to a domain name from an IP address by DNS servers, all of which was routed through the Internet by BGP, allowing them to advertise their website. That’s a lot of acronyms, I know.
Why did facebook go down?
Back to the Facebook outage. When the configuration was changed in Facebook’s servers, Facebook stopped announcing their routes to their DNS servers, indicating there was an issue with BGP. Some Facebook IP addresses were still functioning, but without DNS servers there to translate them, they were essentially useless. From what we currently know, Facebook knocked out its own BGP system, entirely removing itself from the Internet.
It only got worse from there. While engineers tried to reach the data centers and fix it, it appeared that they had lost access. See, when they swipe their keycards to gain entry to the Facebook buildings, the recognition system runs the card through Facebook's own servers to allow them entry. Facebook servers weren't working, preventing engineers from entering the building to allow them to fix it.
"As many of you know, DNS for FB services has been affected and this is likely a symptom of the actual issue, and that's that BGP peering with Facebook peering routers have gone down, very likely due to a configuration change that went into effect shortly before the outages happened (started roughly 1540 UTC)," wrote a supposed Facebook "insider" on Reddit, before deleting the post.
"There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified."
Hours later, BGP activity was restored and DNS servers began resolving the IP addresses into domain names once more. Facebook’s servers were down for around six hours, but the headache for employees will undoubtedly last much longer.