Author(s): Martin Garner
On Tuesday, Cloudflare, a cloud computing services company, suffered a major outage that resulted in millions of websites becoming inaccessible. Cloudflare provides computing and web security services to businesses around the globe. The outage caused sites to show a 502 Bad Gateway error, which means that an Internet server has received an invalid response from another server it's trying to contact.
The episode underscores how companies have overwhelmingly come to rely on Cloudflare. The company's platform effectively acts as a buffer between a website and the end user to block attacks that could bring a site down by overloading it. A bug in Cloudflare's firewall made the system think it was under attack, so it pulled computing power from other company products to shore up defences, as it was programmed to do. But the system took so many resources that it caused outages, as it starved other Cloudflare products that help online businesses deliver their webpages to people around the world.
This latest issue at Cloudflare comes just a week after a different round of global outages hit the company's network. That incident took down a host of popular websites and apps including chat service Discord. Cloudflare pinned the blame on network issues with a fundamental Internet routing system called the Border Gateway Protocol (BGP). Every time a user loads a website or sends an e-mail, BGP is responsible for optimizing the route that the data takes across these sprawling, intertwined networks. And when it goes wrong, the whole Internet feels the pain.
BGP was conceived in 1989 and has largely remained unchanged since 1994. Although it has scaled surprisingly well over the years, there's no denying that the Internet is very different now from how it was 25 years ago. Another fundamental protocol called the Domain Name System (DNS) has also had trust issues recently. Simply put, BGP is the Internet's navigational system, and DNS is its address book. BGP and DNS hijacking has become a major security problem around the world, as some bad actors are intent on inflicting damage.
Online users worldwide have had to deal with several failures over the past month, including a Google Cloud outage that knocked out essential smart home services and left users unable to unlock their doors, and an Instagram also had a wobble that infuriated social media lovers. The problems haven't been confined to the US: European telecom services were affected by a BGP attack in June 2019.
Key parts of the Internet have become increasingly centralized and interdependent. Although there are several levels of redundancy checks, failure of one system can disrupt services globally. The global Internet superhighway is still quite fragile, a concerning thought given the shift toward the Internet of things, in which even the most common objects are part of our networks.
The concept of graceful degradation is used in designing machinery and systems in major industries, such as critical infrastructure, automotive and aerospace. It means that a system can continue to work, possibly with reduced functionality, if a component develops a fault. It's why cars have two independent braking circuits. This concept is used in the design of some websites, but isn't widespread enough in the plumbing underneath. As projects in the Internet of things expand, this approach will become essential — it will simply not be good enough for an oil refinery or smart city to stop completely if there's a fault.
It will become crucial for the Internet to be more decentralized and enable local networks to carry out local tasks, and allow more players to play a role in safeguarding the Internet.