The CloudStrike outage: a tale of trust & tradeoffs🍋
The internet is surprising: fragile yet hard to fix
“CrowdStrike failure: Kiwis wake after night of chaos following global IT outage.” So reported the NZ Herald Saturday morning. Details of the underlying cause of the outage — characterised by some as the biggest ever — are slowly emerging.1 At this stage the outage appears to have been caused by an error by well-intentioned “good” actors, rather than a malicious act by “bad” actors.
To summarise this post, Microsoft, in conjunction with security software providers such as CloudStrike, appear to have built a fast-response system to permit a robust response to emerging global cybersecurity attacks. It was this system, however, that backfired. An error in a small software component caused the type and scale of outage that the system was designed to avert.
There are tradeoffs in designing a fast-response system. High levels of trust are required between all participants.
The detailed story
Falling back on my previous life as an IT nerd …
Cybersecurity firm CrowdStrike shipped an update to its Falcon Sensor software at 4:09pm on 19 July (NZST). That update contained a bug (i.e. a software error with undesirable consequences), which causes Windows computers with Falcon installed to stop working — typically displaying a “blue screen of death”. CrowdStrike issued a corrected update just 78 minutes later; but the adversely affected computers still required manual intervention, in some cases technically complex, to restore them to normal operation.
Computers not running Windows, and those running Windows without Falcon installed, were not affected. Not directly anyway. But in this internet-connected world, pretty much every computer relies on services provided by other computers, even for seemingly basic functionality. This outage reminds us that these “other” computers are mission-critical to the functioning of crucial parts of the 21st Century economy — payment systems, ambulances, airlines, hospitals, supermarkets etc.
For the most part, those running mission-critical computers know how important those computers are, and take steps to protect them from known threats. More than half of Fortune 500 companies use Falcon to keep their computers safe from bad actors’ malware and cyberattacks, according to CrowdStrike. But security software adds complexity, and complexity increases the possibility of error. And threats, it seems, can come from anywhere, including from good actors.
Who you gonna trust?
An operating system (OS) is specialised software that manages a computer’s hardware and provides common services for the programs that run on it. An OS is a fiendishly complex piece of software.
Most of the computers in the world today run an OS derived from one of two originals: Unix, dating back to 1969, and Windows NT, dating back to 1993. Both were designed with security in mind, containing elaborate mechanisms to isolate programs from one another (so a bug in one does not affect others), and to isolate the data and programs belonging to one user from other users (so a bad actor can only damage their own stuff). A superuser (named administrator in Windows, root in Unix) has unfettered rights — only they can (for example) make changes to the OS itself, and add or remove other users. Those other users implicitly trust the superuser to do good — or at least nothing bad. The superuser, in turn, is implicitly trusting the OS authors to have designed and implemented an OS that is robust against likely threats.
Both Unix and Windows pre-date the internet as we now know it. Trust, in a hyper-connected world, involves many more parties. Security is multi-layered, on the principle that an error or malicious action can be isolated and mitigated, unable to affect more-privileged layers where it might do more damage. But this model requires a most-privileged layer, known as the kernel.2 Analogous to the superuser described above, kernel software has unfettered access to what goes on in less-privileged layers. There is, however, no higher layer that detects and handles errors, security breaches and malicious actions by kernel software. And errors in kernel software are ignored (though they may have undesirable effects) unless the erroneous code breach rules built into the computer itself. In the latter case, the breach is reported as a blue screen of death (Windows) or a kernel panic (Unix).
You may have guessed where this is heading. For security software to be comprehensive, it needs to see everything that is happening on the computer. And the only “place” with that visibility is the kernel. The way that happens in practice is that the security software provider writes a small software component called a driver, which gets loaded into, and becomes part of, the kernel each time the computer is “booted” (i.e. started). The CloudStrike outage was caused by a buggy driver.
Drivers can potentially wreak havoc, so Windows is designed to check the cryptographic signature of every driver it loads. Only drivers that have been “signed” by software producers with digital keys issued by Microsoft are loaded. This is a security measure designed to prevent bad actors from writing malicious drivers that become part of the Windows kernel.
So, those companies running Falcon on Windows are trusting both CloudStrike and Microsoft, and trusting them to work closely together. Moreover, the rest of us, reliant on the invisible infrastructure of the internet, are (unwittingly) trusting them too.
A fast response is good … until it isn’t
Software updates on Windows happen via a semi-automated process that downloads new programs and drivers, and, if necessary, prompts the user to restart their computer. That can’t explain how a buggy driver, even if properly signed by CloudStrike, managed to spread so far in just 78 minutes.
During a reboot, the Windows kernel appears to check for updates to some — let’s call them “security-critical” — drivers and download and install those. CloudStrike’s drivers must be included on this security-critical list.
This is where the post gets speculative. CloudStrike must have some mechanism to instruct the computers running Falcon to request a Windows restart, and have invoked it to (presumably unintentionally) spread its buggy driver across the world.
Such an arrangement makes sense if Microsoft, working in conjunction with security software providers, wanted a way to quickly patch the world’s computers in response to a serious emerging security threat. After all, if bad actors can deploy or active their malware in minutes, then the good actors need a way to fight back.
But a fast response is not necessarily a considered response. The fast response mechanism worked successfully in the technical sense, as this outage has shown. But it had no check to prevent it spreading faulty software.
More to the point, at least in an economics blog, it appears there are unavoidable tradeoffs in fast responses to security threats from bad actors. Yes, you can get a fix out there quickly. And that would be an unalloyed good thing — unless the quick fix makes the problem worse.
In unfortunate circumstances such as these, the fast response mechanism can be a security threat in itself. Yes, it is potentially useful. But it comes at a cost.
By Dave Heatley
This post is based on publicly available information. Details are still emerging, so it’s quite possible that parts of this post will quickly become outdated. The main points of this post should still hold; if they do not then I will post a corrected version. Sources I consulted for this post include:
Most computers built in the last decade or so have at least some software running at even higher privilege levels than the OS kernel, for example software that loads the OS and checks whether it has been tampered with before the OS is allowed to run. Such software is outside the scope of this post.