laitimes

8.5 million PCs around the world suffered a blue screen, the culprit is a code logic error, and the security giant was questioned: the software was not tested before pushing the update?!

8.5 million PCs around the world suffered a blue screen, the culprit is a code logic error, and the security giant was questioned: the software was not tested before pushing the update?!

Finishing | Su Mi

出品 | CSDN(ID:CSDNnews)

Last Friday, a "Microsoft blue screen" incident swept the world, after which Microsoft's preliminary estimate estimated that about 8.5 million Windows PCs were affected. Although this number only accounts for less than 1% of the world's Windows machines, as the incident is investigated, it is also found that the so-called "Microsoft blue screen" incident is actually a "disaster" caused by a software update released by independent cybersecurity company CrowdStrike.

Not only have we seen banks, airlines, retailers and other enterprises impacted, but also netizens have reported that:

A friend of mine was stuck in the hospital all day. Their computer system malfunctioned, causing delays in treatment. Medical delays can kill people.

At the same time, this technological crisis comparable to an epic level actually had a precursor, but not too many people cared about it before it broke out.

8.5 million PCs around the world suffered a blue screen, the culprit is a code logic error, and the security giant was questioned: the software was not tested before pushing the update?!

Logical errors are at the root of the problem

Soon after the BSOD incident, many engineers investigated the problem layer by layer, and directly blamed the problem on the csagent.sys file in the software update released by CrowdStrike.

With this in mind, Patrick Wardle, founder of the Objective-See Foundation, took a look at why the csagent.sys file caused the system to crash, tweeting that it contained the wrong directives:

mov r9d, [r8]           

where R8 is an unmapped address. It is taken from an array of pointers (saved in RAZ) and the index RDX (0x14 * 0x8) holds an invalid memory address.

8.5 million PCs around the world suffered a blue screen, the culprit is a code logic error, and the security giant was questioned: the software was not tested before pushing the update?!
8.5 million PCs around the world suffered a blue screen, the culprit is a code logic error, and the security giant was questioned: the software was not tested before pushing the update?!

Other "drivers" (for example, "C-00000291-... 32.sys") seems to be obfuscating data......

Certain data or files (e.g., log entries, network activity logs, etc.) are systematically collected (ingested) by CSAgent.sys and cross-referenced (x-ref'd) with other data to help CrowdStrike Falcon Sensor identify and respond to potential security threats

... So maybe invalid (configuration/signing) data triggered a failure in CSAgent.sys.

Debugging makes it easier to judge/confirm.

8.5 million PCs around the world suffered a blue screen, the culprit is a code logic error, and the security giant was questioned: the software was not tested before pushing the update?!

Source: https://x.com/patrickwardle/status/1814363400253698219

Another developer, Kevin Beaumont, found that "the .sys files that cause the problem are channel update files, and they cause the top CS drivers to crash due to invalid formats." ”

8.5 million PCs around the world suffered a blue screen, the culprit is a code logic error, and the security giant was questioned: the software was not tested before pushing the update?!

Currently, CrowdStrike has issued an urgent announcement after an investigation to reveal the root cause.

8.5 million PCs around the world suffered a blue screen, the culprit is a code logic error, and the security giant was questioned: the software was not tested before pushing the update?!

In the announcement, CrowdStrike stated, "On July 19, 2024 at 04:09 UTC, as part of ongoing operations, CrowdStrike released a sensor configuration update to Windows systems. Sensor configuration updates are an ongoing part of the Falcon platform's protection mechanisms. This configuration update triggered a logic error that resulted in a crash and blue screen (BSOD) for the affected system."

The profiles mentioned here are called "Channel Files" and are part of the behavioral protection mechanisms used by Falcon sensors. Updates to the Channel File are a normal part of the sensor's operation and are updated multiple times a day based on new policies, techniques, and procedures discovered by CrowdStrike. This isn't a new process, but the architecture has been around since Falcon was created.

Then you have to ask why there has been no problem before, and this time it is so big. CrowdStrike revealed technical details about this, stating that on Windows, the Channel File is located in the following directory:

C:\Windows\System32\drivers\CrowdStrike\           

And the file name starts with "C-". Each Channel File has a number that serves as a unique identifier. The affected channel file in this incident is 291, which has a file name that starts with "C-00000291-" and ends with the .sys extension.

While Channel Files end with the SYS extension, they are not kernel drivers.

Channel File 291 controls how Falcon evaluates the execution of named pipes on Windows systems. Named pipes are used for normal, inter-process, or inter-system communication in Windows systems.

The update at 04:09 UTC is designed in response to the newly discovered malicious naming pipes of the C2 framework, which is common in cyberattacks. The configuration update throws a logical error that causes the operating system to crash.

This logical error may affect users running Falcon sensors for Windows 7.11 and later who are online from Friday, July 19, 2024 04:09 UTC to Friday, July 19, 2024 05:27 UTC.

Additionally, systems running Falcon sensors for Windows 7.11 and later are prone to system crashes if they download updated configurations between 04:09 UTC and 05:27 UTC.

CrowdStrike further stated that "a logical error has been corrected by updating the content in Channel File 291." No other changes will be made to Channel File 291 other than the updated logic. Falcon is still evaluating and preventing misuse of named pipelines. This has nothing to do with null bytes contained in Channel File 291 or any other channel file. ”

8.5 million PCs around the world suffered a blue screen, the culprit is a code logic error, and the security giant was questioned: the software was not tested before pushing the update?!

Behind the massive automatic shutdown and restart of Windows XP 14 years ago, the same technical lead?

CrowdStrike officials and their co-founder and CEO, George Kurtz, immediately apologized for the impact of their company's software issues:

I would like to sincerely apologize to all of you for today's disruption. Everyone at CrowdStrike understands the gravity and impact of the situation. We quickly identified the issue and deployed a fix that allowed us to wholeheartedly make restoring our customers' systems our top priority.

The outage was caused by a flaw found in the Falcon content update for Windows hosts. Mac and Linux hosts are not affected. This is not a cyber attack.

We are working closely with affected customers and partners to ensure that all systems are restored so that you can deliver the services your customers rely on.

However, many users apparently did not buy this apology, after all, the cost of a large-scale global outage is incalculable.

As this incident unfolded, according to Forbes, CrowdStrike's market value and George Kurtz's worth shrank accordingly. United States As of 3:30 p.m. ET on Friday, George Kurtz's net worth has decreased by about $300 million, George Kurtz was worth more than $3.2 billion last Thursday and about $2.9 billion on Friday, and CrowdStrike's stock price has plummeted 11% since Thursday's close.

At the same time, there are those who point the finger at George Kurtz. Tech analyst Anshel Sag points out that this isn't the first time George Kurtz has played a major role in a historic IT disaster.

8.5 million PCs around the world suffered a blue screen, the culprit is a code logic error, and the security giant was questioned: the software was not tested before pushing the update?!

Historically, on April 21, 2010, McAfee, an antivirus software company, released an update for the software used by its enterprise customers. The update removed a critical Windows file and caused millions of Windows XP computers around the world to crash and automatically shut down or restart repeatedly.

Very similar to CrowdStrike's bug this time, the McAfee issue also needed to be fixed manually.

According to Anshel Sag, George Kurtz was McAfee's CTO at the time. A few months later, Intel acquired McAfee. Not long after, George Kurtz left the company and founded CrowdStrike, now a cybersecurity giant, in 2012 and has served as CEO ever since.

"For those who don't remember, in 2010, McAfee caused a massive failure of Windows XP that brought down the Internet," Sag wrote on X, "and the man who was McAfee's CTO at the time is now the CEO of CrowdStrike." ”

8.5 million PCs around the world suffered a blue screen, the culprit is a code logic error, and the security giant was questioned: the software was not tested before pushing the update?!

With a little attention, could this "massive blue screen" be avoided?

If once or twice was a coincidence, then many times it happens, and one can't help but wonder how safe the security company is in bringing the software to the table.

In the HN comment section, a user named JackC said:

Back on April 19 of this year, CrowdStrike did something like this with our Linux cluster in production, and I've been very unhappy about it.

In short, we're a civilian technology lab, and we've created a bunch of different websites at different times on different infrastructures. We run the CrowdStrike software provided by the enterprise. CrowdStrike released an update on that Friday night that was not compatible with the latest Debian stable release. At first, we didn't notice this, patched Debian as usual and everything went smoothly for a week, and then suddenly one day all the servers on our multiple websites and cloud hosts crashed badly at the same time and refused to start.

When we connected one of the disks to the new machine and checked the logs, Crowdstrike looked like the culprit, so we removed it manually, and after the machine booted, tried to reinstall it, and the machine immediately crashed again.

Next, we started submitting a support ticket and calling CrowdStrike's engineers.

It took a day for Crowdstrike to respond, and then asked for more evidence (in addition to the above) to prove that it was their fault. They admitted the bug a day later and spent a few weeks doing a root cause analysis, eventually discovering that their test solution didn't include our specific environment (we were using a previous version of Debian stable, which was theoretically supported).

In our own postmortem, there's no real ability to prevent the same thing from happening again – "we push software to your machine at any time, whether it's urgent or not, and don't test it" seems to be at the heart of this company's business model, especially for small IT teams within large enterprises. And that's exactly what they're selling to their corporate customers.

8.5 million PCs around the world suffered a blue screen, the culprit is a code logic error, and the security giant was questioned: the software was not tested before pushing the update?!

https://news.ycombinator.com/item?id=41005936

Coincidentally. In May of this year, some users reported on the Rocky Linux community forum that they had experienced a similar issue after upgrading to RockyLinux 9.4, where their servers crashed due to a kernel error. Crowdstrike support acknowledged the issue and highlighted the lack of testing and attention to compatibility issues between different operating systems.

8.5 million PCs around the world suffered a blue screen, the culprit is a code logic error, and the security giant was questioned: the software was not tested before pushing the update?!
8.5 million PCs around the world suffered a blue screen, the culprit is a code logic error, and the security giant was questioned: the software was not tested before pushing the update?!

Revelation

While in response to this incident, CrowdStrike indicated that the sensor configuration update that caused the system crash was fixed on Friday, July 19, 2024 at 05:27 UTC, saying, "We understand how this issue occurred and are conducting a thorough root cause analysis to determine how this logical flaw occurred." This work will continue. We are committed to identifying any foundational or workflow improvements we can make to enhance our processes. We will update our findings in the root cause analysis as the investigation progresses", but foreign media reports say that many industries are still struggling to recover from the disaster, and the impact is expected to last for several weeks.

Even if no one could have predicted that the global software crash is sometimes just the negligence of a single company, and we are in the midst of the wave of the Internet, how can we as enterprises and developers reduce the occurrence of such incidents?

CSDN consulted the relevant person in charge of UnionTech Software, a backbone enterprise of domestic operating system manufacturers, on this matter, and said:

Looking back at the whole incident, the most direct reason is that CrowdStrike pushed the wrong configuration to the user without adequate testing, but for the entire software system, robustness cannot be relied on without a single component.

In this case, there are three aspects that are worth improving:

  • The first is when it comes to operating system and software upgrades, and the system updates that business-critical systems in production rely on can't be as capricious as CrowdStrike's incident. Updates should first be rigorously tested, and then released to users in small batches and stepwise deployments in grayscale, and can be gradually pushed to other users in batches after the previous batch of users are fine. In addition, enterprises should also reasonably configure relevant policies through enterprise-level centralized management systems such as domain management platforms to avoid uncontrollable failures caused by self-updating of software networking.
  • The second is in the recoverability of the operating system, although Windows provides restore points, safe mode and other functions, but it is not convenient to use, requires professional operations, and the recovery process is long. In particular, if ordinary users perform file operations in safe mode, such as deleting problematic files according to the instructions on the Internet, the slight mishap may even further damage the operating system.
  • Finally, there is the controversial issue of system boundaries. As the infrastructure of the digital age, the stability and compatibility of the operating system need to be guaranteed, the code running on it should be rigorously tested and verified, and any software should interact with the operating system through standard interfaces. In particular, security software often has a high level of privilege in order to enforce security policies, so it is more important to strictly abide by boundaries and do a good job of testing, rather than using intrusive techniques to intercept system functions to achieve relevant functions.

Reference:

https://www.crowdstrike.com/blog/to-our-customers-and-partners/

https://forums.rockylinux.org/t/crowdstrike-freezing-rockylinux-after-9-4-upgrade/14041

https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/

Disclaimer: This article is edited and organized by CSDN, unauthorized reprinting is prohibited!

8.5 million PCs around the world suffered a blue screen, the culprit is a code logic error, and the security giant was questioned: the software was not tested before pushing the update?!

There is a huge amount of information about Linux, and there are many ways to learn Linux, so how can you get a big improvement in a relatively short period of time? Advanced Debugging and Optimization for LINUX Platforms explores the best answer to this question with Linux enthusiasts. Based on the principle of lively and fun, close integration of theory and practice, this workshop takes you through the complex world of Linux with the sword of debugging. This workshop is taught by Zhang Yinkui, the author of "Software Debugging", "A Brief History of Software" and "The Compilation of Beetles".

Read on