Facebook’s Big Outage – Blame Faulty Router Configuration! No Cyberattack!

Facebook’s Big Outage – Blame Faulty Router Configuration! No Cyberattack!

As of Mon. night, Facebook had come back from what may have been its longest blackout ever, & apologised for the mass outage that left billions of users locked out of Facebook, Instagram, WhatsApp, Messenger & Oculus VR for about 6 hours.

One easily disproved conspiracy theory linked the 6-hour outage to a supposed data breach tied to a Sept. 22 hacker forum ad for 1.5B Facebook user records.

Sincere Apologies

*Sincere apologies to everyone impacted by outages of Facebook powered services right now. We are experiencing networking issues & teams are working as fast as possible to debug and restore as fast as possible

— Mike Schroepfer (@schrep) October 4, 2021

In a Mon. night blog post, Santosh Janardhan Facebook’s VP of Infrastructure gave some details about how it all came down, confirming the border gateway protocol (BGP) & DNS problems that experts at Cloudflare had already detected.

Configuration Change

Janardhan stated that the company’s engineering crew had traced the source of the problem to a configuration change on the backbone routers: a change to routers that co-ordinate network traffic between data centres that fractured Facebook’s entire internal key systems.

“Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centres caused issues that interrupted this communication,” Janardhan wrote.

“This disruption to network traffic had a cascading effect on the way our data centres communicate, bringing our services to a halt.”

That’s it, he concluded: no cyberattack, no compromised user data, just Facebook ‘shooting itself in the foot’ by mistake.

Back Online

Our services are now back online and we’re actively working to fully return them to regular operations. We want to make clear at this time we believe the root cause of this outage was a faulty configuration change. We also have no evidence that user data was compromised as a result of this downtime.

When it comes to gauging Facebook’s worst blackout ever, accounts vary: CNBC reported that Mon.’s outage was the longest downtime that Facebook has experienced since 2008, when a bug knocked its site offline for about a day, affecting some 80m users. (Facebook’s user base has increased to 3 billion users since.)

Worst Outage Ever

In 2019, a 1-hour blackout was deemed “catastrophic” & called the “worst outage ever.” That 2019 outage was similarly tied to a server configuration change.

In Mon. evening’s post, Janardhan apologised to “all the people & businesses around the world who depend on us,” explaining that recovering systems took so long because Facebook’s internal tools were also affected.

“We are sorry for the inconvenience caused by today’s outage across our platforms. We’ve been working as hard as we can to restore access, & our systems are now back up & running.

The underlying cause of this outage also impacted many of the internal tools & systems we use in our day-to-day operations, complicating our attempts to quickly diagnose & resolve the problem.” —Santosh Janardhan

How Did Facebook Disappear?

On Mon., Cloudflare Engineering Director Celso Martinho & edge network technical lead Tom Strickx gave a more detailed explanation of what happened, explaining BGP’s role in keeping Facebook’s content flowing to the masses.

“It’s a mechanism to exchange routing information between autonomous systems (AS) on the Internet,” they wrote.

“The big routers that make the Internet work have huge, constantly updated lists of the possible routes that can be used to deliver every network packet to their final destinations. Without BGP, the Internet routers wouldn’t know what to do, & the Internet wouldn’t work.”

Network of Networks

They described the Internet as, literally, a network of networks, bound together by BGP. “BGP allows 1 network (say Facebook) to advertise its presence to other networks that form the Internet,” the Cloudflare experts wrote. During the outage, Facebook wasn’t advertising its presence, meaning that ISPs & other networks couldn’t find Facebook’s network.

During the outage, both Facebook’s BGP records & its domain name system (DNS) records disappeared. DNS is a service that allows the internet to run by translating domains such as Facebook.com into IP addresses & vice versa.

On Mon., Facebook’s DNS servers were unavailable, meaning that DNS resolvers couldn’t respond to queries asking for the IP address of facebook.com, Cloudflare outlined.

‘A Bit Creaky’

John Bambenek, Principal Threat Hunter at IT/security operations firm Netenrich, told explained on Mon. that the core protocols that make up the internet are getting ‘a bit creaky’ at this point. Created in the 70s & 80s, they “were not designed with the scale of the Internet as it exists today,” he commented.

“They also can be very susceptible to human error where small changes can create catastrophic outages, which we see every year or so,” Bambenek continued.

“In some ways, this problem will get worse as these protocols are taken for granted, and those who helped develop & implement them are beginning to reach retirement age.”

Data Breach Conspiracy Theories

As Vice reported, conspiracy theories about the outage being related to a data breach managed to spread even without Facebook & all of its hoax-disseminating messaging apps.

One of the most popular theories about the outage concerned a supposed attack that led to 1.5b Facebook records being sold on the Raid Forums criminal forum.

X2Emails

The conspiracy sprang from a Sept. 22 post from a supposed company called X2Emails that advertised “a database which hold more than 1.5b Database of Facebook these database scraped this year & 100% emails are included & phone as well.”

The conspiracy theory was easy to dismiss 1st, the author stated that the data was scraped, meaning that it wasn’t compromised data coming from a threat player with internal access.

Scraped Data

Jake Williams, Co-Founder & CTO at incident response firm Breach Quest, explained Mon. that “This wouldn’t be the 1st time we’ve seen scammers with scraped data try to capitalise on an outage or related news about an organisation to cash in.”

Not to downplay the damage that can be done with scraped data, that is: In July, just days after a data-scraping operation aimed at LinkedIn was discovered, evidence came to light that in a popular hacker forum that the vast amount of lifted data was being collated & refined to identify specific targets.

Virtual Conference October 2021

 

SHARE ARTICLE