Cybersecurity . Aug 2024

Resolving Microsoft CrowdStrike Integration Issues

Share this Article

twitterlogofacebooklogolinkedinlogo
Introduction

In our highly connected world, even the big players in cybersecurity like Microsoft and CrowdStrike can hit a bump in the road. When they face an outage, it's a big deal because it shows that even top-notch systems have their weak spots. This reminds us all just how crucial strong cybersecurity is for keeping our data and services safe from sophisticated threats. Let's break down what happened with the Microsoft CrowdStrike outage, how it impacted the world, and what steps were taken to fix it. By understanding these details, we can better grasp the challenges of managing cybersecurity in our digital age.

What might be considered the largest IT outage in history was triggered by a botched software update from security vendor CrowdStrike, affecting millions of Windows systems around the world. Insurers estimate the outage will cost U.S. Fortune 500 companies $5.4 billion. The outage occurred July 19, 2024, with millions of Windows systems failing and showing the infamous blue screen of death (BSOD).

CrowdStrike -- the company at the core of the outage -- is an endpoint security vendor whose primary technology is the Falcon platform, which helps protect systems against potential threats in a bid to minimize cybersecurity risks. In many respects, the outage was a real manifestation of fears that computing users had at the end of the last century with the Y2K bug. With Y2K, the fear was that a bug in software systems would trigger widespread technology failures. While the CrowdStrike failure was not Y2K, it was a software issue that did, in fact, trigger massive disruption on a scale that has not been seen before.

What Happened: Understanding the Outage

Overview of the Incident

The Microsoft CrowdStrike outage was a major event that kicked off early on a Friday. The trouble started with a software update from CrowdStrike, targeting their Falcon sensor security software on Microsoft Windows. This update caused widespread “blue screens of death,” those infamous error screens on Windows.

Details of the Affected Updates

CrowdStrike’s update was supposed to enhance the Falcon sensor’s ability to detect new cyber threats. Instead, it had a logic error triggered by a routine sensor configuration update. This update rolled out just after midnight EST on Friday and led to system crashes.

Immediate Impacts Detected

The effects were severe and widespread, hitting various sectors globally. Critical services like air travel faced massive disruptions, with thousands of flights canceled and delays piling up. The healthcare sector was also hit hard, with some surgeries postponed and emergency services experiencing outages. This incident highlighted how essential cybersecurity software is to our modern digital infrastructure.

microsoftcrowdstrrike
Responses from CrowdStrike and Microsoft

Statements from CrowdStrike and Microsoft Executives

CrowdStrike’s CEO apologized for the disruption and assured that they had identified and fixed the issue, focusing on restoring customer systems. Microsoft deployed experts to work with affected customers and collaborated with other cloud providers to mitigate the impact.

Technical Steps Taken to Resolve the Issue

CrowdStrike pinpointed the problematic update and reverted changes to stabilize systems. Microsoft provided manual remediation documentation and scripts and updated the Azure Status Dashboard to keep customers informed. Both companies mobilized full resources to address the issue quickly.

Customer Communication and Support Efforts

Data breaches involving personal or sensitive information can result in regulatory fines, legal liabilities, and damage to organizational compliance standing with industry standards and regulations.

What services were affected

Microsoft estimated that approximately 8.5 million Windows devices were directly affected by the CrowdStrike logic error flaw. That's less than 1% of Microsoft's global Windows install base. But, despite the small percentage of the overall Windows install base, the systems affected were those running critical operations. Services affected include the following.

Airlines and airports

The outage grounded thousands of flights worldwide, leading to significant delays and cancellations of more than 10,000 flights around the world. In the United States, affected airlines included Delta, United and American Airlines. These airlines were forced to cancel hundreds of flights until systems were restored. Globally, multiple airlines and airports were affected, including KLM, Porter Airlines, Toronto Pearson International Airport, Zurich Airport and Amsterdam Schiphol Airport.

Public transit

Public transit in multiple cities was affected, including Chicago, Cincinnati, Minneapolis, New York City and Washington, D.C.

Healthcare

Hospitals and healthcare clinics around the world faced significant disruptions in appointment systems, leading to delays and cancellations. Some states also reported 911 emergency services being affected, including Alaska, Indiana and New Hampshire.

Financial services

Online banking systems and financial institutions around the world were affected by the outage. Multiple payment platforms were directly affected, and there were individuals who did not get their paychecks when expected.

Media and broadcasting

Multiple media and broadcast outlets around the world, including British broadcaster Sky News, were taken off the air by the outage.

Why Apple and Linux were not affected

CrowdStrike's software doesn't just run on Microsoft Windows; it also runs on Apple's macOS and the Linux OS.

But the July outage only affected Microsoft Windows. The root cause of the outage was a faulty sensor configuration update that specifically affected Windows systems. The channel file 291 update was never issued to macOS or Linux systems as the update deals with named pipe execution that only occurs on the Microsoft Windows OS.

The way that the Falcon sensor integrates as a Windows kernel process is also not the same in macOS or Linux. Those OSes have different integration points to limit potential risk. However, there was a reported incident in June from Linux vendor Red Hat, where the Falcon sensor -- running as an eBPF program in Linux -- triggered a kernel panic. In Linux, a kernel panic is a type of crash, though typically not as dramatic as BSOD. That issue was resolved without Red Hat reporting any major incidents.

How long will it take businesses to recover from this outage

CrowdStrike itself was able to identify and deploy a fix for the issue in 79 minutes. While CrowdStrike quickly identified and deployed a fix for the issue, the recovery process for businesses is complex and time-consuming. Among the issues is that, once the problematic update was installed, the underlying Windows OS would trigger BSOD, rendering the system inoperative using the normal boot process.

Some businesses were able to apply the fix within a few days. However, the process was not straightforward for all, particularly those with extensive IT infrastructure and encrypted drives. The use of the Microsoft Windows BitLocker encryption technology by some organizations made it significantly more time-consuming to recover as BitLocker recovery keys were required.

It is estimated that it could potentially take months for some organizations to entirely recover all affected systems from the outage.

Hackers take advantage of outage

While the outage was not due to a cyberattack, threat actors have taken advantage of the incident.

According to a blog post from CrowdStrike, the security vendor has received reports of the following malicious activity:
  • Phishing emails sent to customers posing as CrowdStrike support.
  • Fake phone calls impersonating CrowdStrike staff.
  • Selling scripts claiming to automate recovery from the botched update.
  • Posing as independent researchers saying the outage was due to a cyberattack and offering remediation insights.
  • CISA urges individuals and organizations to only follow instructions from legitimate sources and avoid opening suspicious emails and links.

    How can businesses be better prepared for tech outages

    The CrowdStrike Windows outage highlighted the vulnerabilities of modern society's heavy reliance on technology. While system backups and automated processes are essential, having manual procedures in place can significantly enhance business continuity during tech outages.

    Here are a few key takeaways for bolstering your disaster

    Recovery plans:

    • Practice Regular DR Drills and Update/Review Plans Continuously: Run simulations of possible outage scenarios to test your response strategies and find any weaknesses and regularly review your DR plans to adjust to new threats.
    • Backup Essential Data: Regularly back up all crucial data and store it in multiple.
    • Have a Failover Plan: Determine your fallback plan to get back to your production environment.
    How can businesses be better prepared for tech outages

    The CrowdStrike Windows outage highlighted the vulnerabilities of modern society's heavy reliance on technology. While system backups and automated processes are essential, having manual procedures in place can significantly enhance business continuity during tech outages. But there are a few things businesses can do to be better prepared for tech outages, including the following.

    Test all updates before deploying to production

    It has been a best practice for years to allow automated updates to ensure systems are always up to date. However, the CrowdStrike issue laid bare the underlying risk with that approach. For mission-critical systems, testing updates before deployment or having some form of staging environment before pushing updates to production might help to mitigate some risk.

    Develop and document manual workarounds

    Manual workarounds ensure critical business processes can continue even when technology fails. This approach was common before the digital age and, in the event of outage, can serve as a fallback. Documenting and practicing manual procedures can help mitigate the effect of outages, ensuring businesses can still operate and serve their customers, even during an outage.

    Perform disaster recovery and business continuity planning

    Outages happen for any number of different reasons. Having extensive disaster recovery and business continuity practices and plans in place is critical. Part of that effort should include the use of redundant systems and infrastructure to minimize downtime and ensure critical functions can switch to backup systems when needed.

    Conclusion

    In conclusion, the unexpected partnership between Microsoft and CrowdStrike has given us plenty of fodder for both concern and comedy. While their intentions were undoubtedly rooted in robust cybersecurity, the outcome has provided an ironic twist that’s hard to ignore. This incident serves as a stark reminder that even the most trusted names in the industry are not immune to mistakes. Let’s hope they learn from this slip-up and come back stronger, turning this surprise party into a sobering lesson in vigilance and better security practices. In the meantime, we can’t help but appreciate the inadvertent humor in their stumble—a reminder that in the world of cybersecurity, even the giants can trip.

    Pirai Infotech at the Forefront: AI in Action for Transportation
    Ready to take your business to the next level?

    Contact us today for a free consultation

    Divider Image
    +91 8015148627
    Picture of the author

    Recent Articles:

    Accelerate Your Success
    With Us

    Pirai Enquiry Form
    Phone

    Subject