Compile | Zheng Liyuan
出品 | CSDN(ID:CSDNnews)
It's been 6 days since the Windows BSOD incident.
In the past 6 days, there has been a lot of heated discussion on this matter on domestic and foreign technology websites, and the name of the "culprit" CrowdStrike has been frequently mentioned, accompanied by doubts and condemnations:
- The system failure caused by CrowdStrike has grounded thousands of flights, paralyzed hospitals, and collapsed payment systems, and is what experts call the biggest IT failure in history.
- According to Parametrix Insurance, the global technology outage triggered by the CrowdStrike bug update has left United States Fortune 500 companies (excluding Microsoft) facing $5.4 billion in economic losses, with the total global economic losses likely to be around $15 billion.
Based on this, CrowdStrike's stock price has rapidly plummeted by more than 20% this week. As an apology for causing the outage, it is reported that CrowdStrike also provided its partners with a $10 Uber Eats gift card as an apology: "To show our apologies, we'll treat you for your next cup of coffee or supper!" However, some users who received the gift card said that when they went to redeem it, the page prompted that the gift card "has been canceled by the issuer and is no longer valid."
In addition to the above attention and reports focusing on CrowdStrike itself, there is another topic that has recently caused a lot of discussion in the developer circle: "If CrowdStrike switches to Rust, will the world's 8.5 million PCs not have a blue screen?" “
I was surprised to find out that everything that has happened in the last few days has been caused by something as simple as deref. The industry has been using C++ for decades, and all the tools, linters, sanitizers, testing, and peer review are not enough to prevent this from happening. So I wondered, would it be very different if I switched to Rust?
Not only that, but Mark Russinovich, CTO of Microsoft's Azure division, also tweeted his tweet from 2022 after the incident: "Speaking of languages, it's time to stop starting any new projects in C/C++ and use Rust if you need to use a non-GC language." For the sake of security and reliability, the industry should declare these languages obsolete. ”
Seeing that many Rust enthusiasts are starting to say, "Yes, Rust is the only answer", Julio Merino, a veteran software engineer who also loves Rust, came to the conclusion after a rational analysis: "Even Rust can't save this CrowdStrike outage." ”
The following is the translation:
I'm a big fan of Rust and agree that you shouldn't continue to use memory-insecure programming languages like C++, but I'll say that claims that Rust would have avoided last Friday's massive global network outage are exaggerated and detrimental to Rust's reputation.
If CrowdStrike is written in Rust, it does reduce the likelihood of failures, but it doesn't solve the root cause of failures. So I was annoyed to see many people saying that Rust was the only answer to this accident - a statement that not only did not promote the adoption of Rust, but rather attracted disgust: C++ experts knew the root cause of the incident, and it was unpleasant to see such a misleading statement, which would lead to further fragmentation of the systems programming world.
So why can't Rust solve this problem? I'm going to try to answer this question and dig a little deeper into what caused the failure.
Failure analysis
Here's a "post-mortem analysis" from CrowdStrike's official:
On July 19, 2024 at 04:09 UTC, as part of ongoing operations, CrowdStrike released a sensor configuration update to Windows systems. Sensor configuration updates are an ongoing part of the Falcon platform's protection mechanisms. This configuration update triggered a logic error that resulted in a crash and blue screen (BSOD) for the affected system
A sensor configuration update that caused the system to crash was fixed on July 19, 2024 at 05:27 UTC.
To translate the above passage as "human words", that is:
1. CrowdStrike has pushed a configuration update.
2. The update triggers a potential bug in the "Falcon Platform".
3. This bug in Falcon caused Windows to crash.
The first two points are not surprising: configuration changes are "commonplace" for any online system, and it is not uncommon for these updates to cause bugs in the code. In fact, most downtime events are caused by human configuration changes.
Obviously, we should ask why this bug exists and how to fix it to improve the stability of the product. But let's not forget the third point: why can this bug bring down an entire machine? More importantly, why did this bug bring down so many systems around the world?
Memory error
Let's start with the first question: what is the nature of the bug in Falcon?
It's simple: there's a logic error in the "Channel Files" (aka profile) parser that tries to access an invalid memory location when encountering some invalid input. The specifics don't matter: it could be a dereference null pointer, a general protection failure, and so on. Here's the point: the crash was caused by an invalid memory access issue.
At this point, some Rust fanatic might jump out and say, "Look, sure enough! If the code was written in Rust, this bug wouldn't exist! I can't deny the claim that this particular bug really wouldn't have been with Rust.
But so what? Even if this type of bug is avoided, the next time you encounter a bug that Rust can't avoid, it will still go down — ignoring the nature of Falcon and focusing only on memory errors is like "no forest for the trees".
So, what exactly is Falcon?
Kernel panic
In my opinion, Falcon is a type of "malware...... It's just malware for good guys", that is, an endpoint security system. Falcon is typically installed on corporate machines so that security teams can detect and neutralize threats in real-time (while monitoring employee behavior). There's some value to this: most cyberattacks start with hacking into corporate machines through social engineering means.
This type of product must have control over the machine, it must be able to intercept all users' files and network operations to scan its contents, and it must also be tamper-proof in case a "savvy" business user tries to disable it after reading some dubious instructions for fixing WiFi online to avoid submitting IT tickets.
How do you achieve a product like Falcon? The easiest way to do this, and what Windows encourages, is to write a kernel module. Obviously, Falcon is a kernel module, so it runs in kernel space. This means that any error in the Falcon code can break a running kernel and cause the entire system to crash.
What I say by "any error" is true. Not only can the kernel crash due to a memory error, but it doesn't have to be a "kernel crash" to render the machine unusable: deadlocks can prevent the kernel from moving forward, logic errors in system call handlers can prevent any files from being opened later in user space, and an infinite recursive algorithm can drain the kernel's stack...... There are so many ways to destabilize the kernel that I say that even Rust can't completely avoid this kind of accident.
Rust's memory safety can only address one type of crash. In addition, the focus on correctness in the Rust ecosystem does minimize the occurrence of other types of logic errors. But...... While we all want to be perfect, we also have to accept that mistakes happen – asserting that Rust is the only answer to a problem is just as irresponsible as sticking with C++.
Keep in mind that there are many more C++ developers working in the kernel space than Rust developers who understand the internals of the kernel. As a result, most C++ developers are aware of the ridiculousness of this statement, which increases the hostility between the two communities and completely defeats the goal of getting people to move to a safe language. Rust developers know that Rust can indeed improve the status quo, but C++ developers can't accept it because they hear ideas that don't resonate with them.
From kernel space to user space
It has also been said that this would not have happened at all if Falcon had not been running in the kernel. Well, that's a little bit better, but ...... This alone does not necessarily solve the problem.
As I mentioned earlier, Falcon needs to be as tamper-proof as possible, prevent malware from interfering with it, and prevent compromised users from trying to disable it. If malware or humans can do this easily, then this product is useless.
The Windows kernel now has the full ability to ban Falcon-like kernel modules. Instead, the kernel can expose a set of APIs that user-space applications can access to provide similar functionality. Did you know that Microsoft did try to get Windows in this direction, but the antivirus company threatened to sue on antitrust grounds, and the whole plan went to naught? As a result, we can now only tolerate a less secure system because antivirus companies need to sell those annoying products.
But let's put that aside. Even though Falcon runs in user space and communicates with the kernel via a controlled API...... Is this enough to prevent system failure? Note that these APIs also need to be tamper-proof. Imagine if you wanted the user-space driver to validate before the kernel executes each binary, i.e., the kernel would need to get an answer from the user-space driver every time it executes, and the driver had a problem, then the system would no longer be able to execute any programs.
But if you make it optional for the kernel to communicate with the driver so that the kernel can tolerate a crashing driver, you're opening the way for malware to try to crash the driver before hacking the system.
Therefore, simply "migrating to userspace" is clearly not the solution.
Vulnerabilities in deployments
If we have to accept that bugs exist, memory-related bugs aren't the only cause of system crashes, and moving drivers to userspace isn't a good solution...... Is there nothing that can be done? Is there really no way to prevent this from happening?
All of the above are things that can (and should) be taken to reduce the probability of system failures, but we have to accept the fact that this bug is only a specific trigger, and even a different trigger could have similar consequences. The root cause of this global outage is the release process of configuration changes.
According to SRE 101 (or DevOps, whatever you want to call it), configuration changes must be made in stages, deployed in a slow and controlled manner, and validated at each step. These changes should be validated on a small scale before being pushed globally, and each push should be incremental.
Considering the criticality of Falcon and the huge impact a bug can have, it's hard for me to believe that CrowdStrike didn't do any validation of the deployment. But based on a postmortem of CrowdStrike's latest update, it's incredibly oversightful that they didn't do any kind of testing or canary deployment (testing the changes to a small group of users before rolling them out to the entire service cluster).
So, CrowdStrike's deployment practices were the culprit for this incident - that is, the outage was a process issue, not a code or technical issue, and switching to Rust wouldn't help.
CrowdStrike Publishes Preliminary Review Report, Summary: "Tests and Processes Are Inadequate"
As Julio Merino said, CrowdStrike recently published a preliminary review of the incident on its official website, revealing the overall timeline of the incident:
- The security failure began on February 28 when CrowdStrike developed and distributed a Falcon sensor update designed to detect an emerging attack technique that leverages Windows naming pipelines, and the sensor update passed routine testing before release.
- On March 5, the update was stress-tested and validated for production. As a result, CrowdStrike distributed a rapid response update to customers using the new maliciously named pipeline detection that same day.
- Between April 8 and April 24, CrowdStrike pushed three more Rapid Response updates using this new code template, stating that all of them "worked as expected in production."
- On July 19th, CrowdStrike released two more Rapid Response updates using the March sensor template, but this time one of them pushed data in the wrong format. However, there was a problem with CrowdStrike's verification system to check if content updates were working as expected, and it didn't find an error in this configuration file to be pushed to everyone. As a result, this buggy update, which should have been stopped, caused 8.5 million PC blue screens worldwide.
After reading CrowdStrike's lengthy preliminary review report, some netizens incisively concluded: "Having said so much, I just want to say that our tests and processes are not perfect, and we accidentally released the garbage"; "The word count is amazing, but it comes down to bugs and not enough tests in the test code"; "Sorry, the only test we did with this update was an automated test that didn't really pass."
Therefore, the cause of this accident is definitely not solved by switching to Rust, and the root cause is the non-standard testing process and deployment process. At the same time, CrowdStrike also said in the post-event summary that in the future, it will increase software testing before releasing updates, and gradually roll out updates, and the specific remedies are roughly divided into three parts:
1. Software resiliency and testing
(1) Improve rapid response content testing by using the following test types: local developer testing, content update and rollback testing, stress testing, fuzzing testing and fault injection, stability testing, and content interface testing;
(2) adding additional validation checks to the validator of the rapid response content;
and (3) enhancements to the existing error handling capabilities in the content interpreter.
2. Rapid response to content deployment
(1) Implement a staggered deployment strategy for rapid response content, starting with canary deployment and gradually deploying updates to larger regions;
(2) improve monitoring of sensor and system performance and gather feedback during rapid response content deployment to guide phased deployment;
(3) allow users to choose when and where to deploy, giving them greater control over the delivery of rapid response content updates;
and (4) provide details of content updates through release notes to which customers can subscribe.
3. Third-party verification
(1) Conduct multiple independent third-party security code reviews;
and (2) independent review of the end-to-end quality process from development to deployment.
Reference Links:
https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/
https://blogsystem5.substack.com/p/crowdstrike-and-rust
Large models refresh everything, leaving us with a lot of confusion, where will the AI boom push us to? In the face of anxiety that changes overnight from time to time, how can developers embrace large models faster and more systematically? "New Programmer 007" is based on the core of "Developer's Growth Guide in the Era of Large Models", hoping to clear the layers of fog and let developers see and embrace the future with their hearts.
Developers who read the book said, "I'm pleasantly surprised that there are such high-quality, developer-friendly magazines in China, and I'm very excited. What attracted me the most was that there were a lot of people's opinions and experiences with AI and some interviews that were both real and valuable. ”