More details about the October 4 Facebook outage

Today the team at Facebook Engineering posted the following detailed explanation of yesterday’s outage. It’s worth a read.

Facebook Engineering — Now that our platforms are up and running as usual after yesterday’s outage, I thought it would be worth sharing a little more detail on what happened and why — and most importantly, how we’re learning from it.

This outage was triggered by the system that manages our global backbone network capacity. The backbone is the network Facebook has built to connect all our computing facilities together, which consists of tens of thousands of miles of fiber-optic cables crossing the globe and linking all our data centers.

Those data centers come in different forms. Some are massive buildings that house millions of machines that store data and run the heavy computational loads that keep our platforms running, and others are smaller facilities that connect our backbone network to the broader internet and the people using our platforms.

When you open one of our apps and load up your feed or messages, the app’s request for data travels from your device to the nearest facility, which then communicates directly over our backbone network to a larger data center. That’s where the information needed by your app gets retrieved and processed, and sent back over the network to your phone.

The data traffic between all these computing facilities is managed by routers, which figure out where to send all the incoming and outgoing data. And in the extensive day-to-day work of maintaining this infrastructure, our engineers often need to take part of the backbone offline for maintenance — perhaps repairing a fiber line, adding more capacity, or updating the software on the router itself.

This was the source of yesterday’s outage. During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.

This change caused a complete disconnection of our server connections between our data centers and the internet. And that total loss of connection caused a second issue that made things worse. MORE@Facebook Engineering

The author goes on to describe the cascading DNS, BGP and internal data center problems that followed. It’s really worth a read. MORE@Facebook Engineering

SVC Online Editor

Your browser is out-of-date!

More details about the October 4 Facebook outage

Featured Articles

Shure Introduces Microflex Advance MXA901 Conferencing Ceiling Array Microphone Alongside Preview of Designer 6.0 Configuration Software

Extron issues supply chain statement

Best of InfoComm 2023

Top Ten Pro AV stories of the week

Sonnet Announces Echo™ 20 Thunderbolt™ 4 SuperDock

Blackmagic Design Announces New ATEM 4 M/E Constellation 4K

Arkona Technologies Partners with 4 Vision to Bring Cutting-Edge IP Infrastructure Solutions to Poland

From The Wire

VIVOTEK Launches New Multiple Sensor Network Camera During ISC West 2024

VuWall Unveils Highly Secure PoE Touch Panel for Command and Control Room Environments

Leon’s Top-Selling FrameBar and Entire Ultra-Thin Line of Speakers Get a Serious Performance Boost

Extron now shipping expansive range of new VoiceLift Pro microphone systems