CrowdStrike and the Fallout – What Happened?
On July 19th, the world came to a stop with the CrowdStrike Falcon sensor incident that halted computer systems around the world. It is safe to state there are probably systems in every country around the globe impacted, not only because of companies who used this product from CrowdStrike but because Microsoft uses this product, and therefore millions of systems were affected. Though Microsoft stated only 1% (8.5 million Windows devices) of their systems were out of service, this had a huge impact across the world and brought to light what could happen if a larger percentage of Microsoft products, and or systems, were would have been impacted.
I asked ChatGPT how it would describe it and this is what I received back (not bad):
Let's imagine your computer is like a robot and this robot has sensors that help it understand what's going on around it. These sensors could be like its eyes, ears, or hands, allowing it to see, hear, or touch things.
Now, the robot's brain needs to know what the sensors are sensing, right? This is where the "kernel" comes in. The kernel is like a translator and controller for the brain. It helps the brain understand the signals from the sensors and tell them what to do.
Here's a simple way to think about it:
Sensor: This is like the robot's eyes, ears, or hands. It gathers information from the world, like seeing a ball, hearing a sound, or feeling a touch.
Kernel: This is like the translator for the robot's brain. It takes the information from the sensors and helps the brain understand it. It also helps the brain send instructions back to the sensors.
So, if the robot sees a ball (the sensor detects something), the kernel helps the brain understand that it's a ball and tells the robot's hand to pick it up. Without the kernel, the brain wouldn't know what the sensors are sensing, and the robot wouldn't work properly.
In a computer, sensors could be things like a keyboard, mouse, or camera, and the kernel helps the computer's brain (the CPU) understand and control these devices.
Now, let's say the helper (kernel driver) suddenly stops working or falls asleep. What happens? The brain can't talk to the parts anymore! So:
The robot might stop moving: Just like the computer stops working properly.
Some parts might not work at all: Maybe the wheels stop turning, or the arms can't lift anything.
The robot might freeze: Like how a computer screen can freeze and not respond to any clicks or keys.
When this happens to a computer, we say the system "goes down," meaning it can't do its job until the helper (kernel driver) is fixed or restarted.
Ok, so why was this impact SO significant and the patch utilized didn't automatically "resolve" the concern with a reboot?
Computer systems operate at different levels. The CrowdStrike kernel driver impacted the entire system versus just an application. If this was a CrowdStrike application, then this would have been much more contained. Think about it like this: if an update happens in kernel mode it impacts the entire system (in this case the servers). If an update occurs in application mode, it impacts the application. As such, the entire system being down resulted in blue screens. This also required a manual update, which can make it very challenging for servers that are in data centers around the country and the world.
It's safe to state a lot was learned from this incident which has been called the greatest IT failure in history. Of course, history will continue to be the judge as more information is shared and as more debates occur around how this could have been prevented, how quality assurance can be improved, how GDPR is applied and adhered to in these instances, and more.