close
close

topicnews · September 11, 2024

Lessons from CrowdStrike, two months after the disaster

Lessons from CrowdStrike, two months after the disaster

Companies – and their Customers expect their IT services to run continuously. Although no system is completely error-free, failures or downtime should be measured in seconds or minutes at most.

An outage lasting several days or weeks is almost unheard of, and downtime lasting more than a week is not only unacceptable, but could bankrupt even the largest companies.

The recent CrowdStrike outage is a perfect example of this: Delta Air Lines not only had to cancel around 7,000 flights within five days, but is also facing an investigation by the US Department of Transportation because of the disruptions.

It is estimated that the airline’s losses amount to around $500 million, not including the costs of regulatory and legal action that the company will have to deal with as a direct result of the outage. Delta was not the only company affected. Banks and hospitals also had to deal with the consequences of what some say was the world’s largest IT outage.

According to Microsoft, 8.5 million Windows computers worldwide crashed due to a bug in a CrowdStrike update, and it took the company 10 days to fully fix the problem. No wonder the security software company is facing multiple lawsuits, one of which was brought by its own shareholders, who accused CrowdStrike of making “false and misleading” claims about its software testing.

Delta CEO Ed Bastian has publicly accused CrowdStrike and Microsoft of failing to provide “exceptional service.” Both technology companies responded by declaring that they will defend themselves “aggressively” and “vigorously” if further legal action is taken. Microsoft has tried to shift responsibility to Delta Air Lines, saying its preliminary review suggests that Delta, unlike its competitors, appears to have failed to modernize its IT infrastructure.

Microsoft should stay on its track

When we use cloud services, we trust those providers to perform thorough testing procedures before making changes to their infrastructure. If they don’t, a CrowdStrike scenario inevitably occurs. Microsoft trusted CrowdStrike so much that it accepted updates pushed by CrowdStrike directly into its Azure production infrastructure. Even though CrowdStrike was responsible for the bug, Microsoft should have had processes in place to deploy things on “canary servers” before allowing them into production.

And the same should be true for any IT service. If you outsource critical services to external vendors, you are exposed to the quality of their processes. If you choose to do everything in-house, you retain control over the phases of rolling out to production. Of course, many people who did their stuff in-house suffered anyway – because they didn’t do any “canary server” testing themselves.

While Microsoft likes to blame CrowdStrike, the reality is that the software giant has integrated Office 365 into every type of business functionality imaginable, including mission-critical and customer-facing operations like billing services and call centers. A situation like the CrowdStrike outage just shows how short-sighted it can be to rely entirely on Microsoft products for companies that need more specialized and reliable solutions.

Lessons from CrowdStrike, two months after the disaster
The author, MIP Holdings CEO Richard Firth

For years, companies have increasingly believed Microsoft’s PR that the software giant can provide them with everything they need. But that has led to companies putting all their eggs in one basket. Not only does this increase the risk of something going wrong, but it also makes it more likely that solving a problem will be more difficult if the solution depends on software developers in a different time zone who may not understand the urgency or magnitude of an outage.

There is no doubt that Microsoft excels in certain areas, but there is a reason for the existence of software companies like MIP, and that reason is the ability to design and develop solutions tailored to the specific needs of organizations. Using specialized solutions not only ensures that companies can provide uninterrupted service to their customers, but also that security and other risks are minimized.

Skills matter

Unfortunately, Microsoft’s success is partly due to the fact that there are few software development companies with the skills and capacity to deliver specialized solutions to companies like Delta Air Lines. In some cases, the lack of entrepreneurial skills in building IT platforms only manifests itself in the ubiquity of out-of-the-box solutions that require high investments to make them work properly. In other cases, however, this lack leads to difficulties in business processes, which directly impacts the performance of companies.

If more people had the development skills needed to create bespoke solutions – and the ability to integrate them effectively into common programs like those from Microsoft – companies would have access to a wider variety of tools. This would not only provide better opportunities for companies facing technical challenges, but would also ensure that the technologies used are selected in a way that minimises any risks.

Read: CrowdStrike is being sued… by its own shareholders

For example, microservices ensured that the impact of the CrowdStrike outage was limited across all affected organizations, allowing companies to continue operating while the problem was remediated. Microservices also mitigated Microsoft’s complaint that Delta Air Lines had failed to modernize its IT environment so that certain services could be organized by business capabilities rather than infrastructure.

If the CrowdStrike outage proved anything, it’s that software development skills are more important than ever. In today’s technology-driven world, everyone should have a programming or software development background – if only to understand CrowdStrike’s explanation of what caused the outage – and how the company plans to ensure such a scenario never happens again.

Perhaps the most important lesson here is: you can’t just outsource everything and expect everything to go perfectly. Ultimately, you remain responsible for your business operations, and when you choose to trust someone else to do something for you, you may be outsourcing some of the work, but not really the responsibility. You should still be careful. And if you take the risk of outsourcing, don’t cry when the risk materializes.

Don’t miss:

Microsoft hosts security summit after CrowdStrike disaster