TURNING DISASTERS INTO LESSONS:

LEARNING FROM THE CROWDSTRIKE OUTAGE

Tony Lau, Client Partner | Fred Turner, VP Technical Operations, CISO | Edward Philip, Technical Director

Published: July 26, 2024 in Blog

The recent CrowdStrike outage sent ripples across global businesses, serving as a powerful reminder of the vulnerabilities inherent in even the most robust systems. This outage underscores the importance of not only building reliable systems but also ensuring that these redundancies encompass servers, networks, as-a-service solutions and diverse security technology stacks, significantly minimizing the overall risk profile. By doing so, organizations can mitigate the risks associated with single points of failure and enhance their overall security posture.

The CrowdStrike Outage: A Critical Learning Opportunity

CrowdStrike, known for its advanced threat detection and response capabilities, experienced an unexpected outage that affected numerous businesses relying on its services. While the cause of the outage remains under investigation, it has highlighted a significant issue: the dangers of over-reliance on a single security provider. Even the most reliable systems can fail, and when they do, the impact can be widespread and severe.

Furthermore, as organizations transition from legacy, custom-built applications to Software as a Service (SaaS) solutions or adopt Managed Services, Platform as a Service (PaaS), and Infrastructure as a Service (IaaS) models, they often overlook the extent of their dependence on third-party vendors. This shift can lead to an incomplete understanding of their Application Risk Profile, leaving them vulnerable to unforeseen issues.

CIOs and IT Executives must manage a large and ever growing number of applications as part of their portfolio. Application Portfolio Management is more critical now than ever.

In today’s fast-paced digital landscape, maintaining application risk is a top priority for organizations. Effective tools for Application Portfolio Management (APM) and Application Lifecycle Management (ALM) play a crucial role in this process, providing comprehensive solutions to identify, assess, and mitigate risks throughout the application lifecycle. Here’s how these tools help organizations manage their application risk profile:

Centralized Risk Data

APM and ALM tools centralize risk data, eliminating the need for scattered spreadsheets and siloed information. This centralized approach ensures that all relevant stakeholders have access to up-to-date risk information, enhancing visibility and collaboration across the organization.

Streamlined Risk Assessments

These tools automate risk assessment processes, making it easier to model the probability and evaluate the impact of risks. Automated workflows and predefined surveys help gather necessary data efficiently, reducing the time and effort required for risk assessments and ensuring that the risk register is regularly updated.

Enhanced Risk Visibility and Reporting

APM and ALM tools come with robust reporting and visualization capabilities. Pre-populated reports and dashboards provide clear insights into risk levels, control effectiveness, and mitigation plans. These tools enable IT leaders and risk managers to quickly identify areas of concern and prioritize risk mitigation efforts effectively.

Proactive Risk Management

By continuously monitoring applications and their associated risks, APM and ALM tools support proactive risk management. Alerts and notifications are sent to relevant stakeholders when there are changes in risk levels or control effectiveness, ensuring timely responses to emerging threats.

Improved Compliance and Control

APM and ALM tools help organizations align with regulatory requirements and internal control frameworks. They provide a structured approach to documenting and assessing controls, identifying gaps, and ensuring that compliance efforts are consistent and effective across the application portfolio.

Facilitated Collaboration

These tools foster collaboration between IT, security, and compliance teams by providing a unified view of risk data. Surveys and feedback mechanisms enable engagement with risk owners and stakeholders, ensuring that everyone is aligned on risk priorities and mitigation strategies.

Metrics and KPIs Tracking

APM and ALM tools track various metrics and key performance indicators (KPIs) related to application risk. This includes the number of applications at risk, control implementation trends, and the status of risk remediation efforts. These insights help organizations measure the effectiveness of their risk management initiatives and make informed decisions.

System Redundancy should be a primary imperative when making IT decisions.

Redundancy is a foundational principle in system design, aimed at ensuring availability despite failures. However, redundancy must extend beyond just having additional application components, backup servers or alternate network paths. It should include different security technology stacks to prevent a single point of failure in the security infrastructure.  Adopting this principle provides other benefits as well:

  1. Uninterrupted Service Availability: Redundant servers and networks ensure that if one server or network path fails, another can take over without any noticeable disruption to services. This is crucial for maintaining operational continuity and ultimately customer trust.
  2. Load Balancing and Performance Optimization: Redundant systems can distribute workloads more evenly, optimizing performance and preventing bottlenecks. This is particularly important during peak usage times or unexpected traffic spikes.
  3. Disaster Recovery and Business Continuity: In the event of a catastrophic failure or cyberattack, having redundant servers and networks allows for faster recovery and minimizes downtime. This is essential for maintaining business operations and reducing financial losses.

The Role of Diverse Security Technology Stacks

In addition to redundant servers and networks, employing diverse security stacks is vital for comprehensive protection across:

  1. Comprehensive Threat Coverage: Different security solutions offer various strengths and capabilities. By leveraging a mix of security stacks, organizations can cover a broader spectrum of threats and vulnerabilities, reducing the likelihood of a successful attack.
  2. Resilience Against Exploits: A diverse security approach ensures that if one solution fails to detect or mitigate an exploit, another may catch it. This layered defense mechanism enhances overall security.
  3. Avoiding Single Points of Failure: A single security provider can create a critical single point of vulnerability. Diversifying the security stack mitigates this risk, ensuring that the failure of one solution does not compromise the entire system.

Implementing Reliable Systems with Diverse Security Stacks

To build a resilient infrastructure, organizations should consider the following key steps:

  1. Conduct an Infrastructure Assessment: Evaluate the current infrastructure to identify critical components and potential points of failure. This includes servers, networks, as-a-service offerings and security technology solutions.
  2. Strategic Redundancy Planning: Develop a redundancy plan that includes duplicate servers and network paths in addition to different security technology stacks. Ensure that these redundancies are strategically placed to maximize coverage and minimize risk.
  3. Vendor Diversity: Engage a mix of security vendors, each known for their unique strengths. This could include a combination of established providers and innovative newcomers, ensuring a balance of reliability and cutting-edge technology.
  4. Integration and Interoperability: Ensure the different systems and security solutions can work together seamlessly. Conduct rigorous testing to identify any integration issues or coverage gaps.
  5. Controlled Reliability Testing and Certification: Creating scenarios and testing the environment to exercise any potential areas of degradation to determine whether, in an imperfect state, your infrastructure, services, and applications continue to maintain the minimal acceptable level of service that your customers demand. Modeling how your system performs during a degraded state provides feedback and investment opportunities on non-functional aspects of your system.
  6. Ongoing Monitoring and Improvement: Regularly monitor the performance of the redundant systems and security stacks. Stay informed about emerging threats and continuously update and improve the infrastructure to address new challenges.

Wrapping it up:

The CrowdStrike outage provides a crucial lesson in the importance of prioritizing Application Portfolio Management activities, as well as ensuring systems are designed to be resilient. By implementing redundant servers and networks with diverse security technology stacks, organizations can significantly enhance their operational continuity and security posture. And by undertaking APM and ALM activities, risks can be identified and monitored throughout the lifecycle of every application in an organization’s IT portfolio. In an era where cyber threats are increasingly sophisticated and pervasive, redundancy, diversity, and application risk profile management are not just best practices—they are essential components of a robust defense strategy.

Want to ensure your organization’s IT Infrastructure is resilient? Reach out to us for a rapid assessment.