Using Chaos Theory to Drive App Security—By Design

Here’s a lesson we can learn from Netflix: Agility and lean DevOps development are only the first steps to accelerating speed to market. To be successful, you’ve also got to be reliable, resilient and secure.

Securing engineering software to be secure—especially in the cloud—is one of the biggest challenges facing government agencies. It’s not that success is so elusive, but rather that modern methods pose particular challenges to the government’s conventional ways of doing business.

Agile processes and delivery—i.e., iterative improvements using a series of continuous, rapid releases—is now widely recognized as a better way to develop software. Agility allows for evolving requirements and unanticipated changes, rather than forcing government customers to lock down all possible requirements ahead of time.

The lesson from Netflix, though, is that agility only takes you so far. In the cloud, systems must also be resilient in the face of outages and surprise service interruptions. Netflix executives realized that if their systems weren’t engineered for resilience from Day 1, it was just a matter of time before disaster struck. The problem was finding a way to test for resilience.

Today’s consumers take using the cloud and streaming video for granted. But just a few years ago, when Netflix was transitioning from its own physical infrastructure to the cloud, company leaders worried that users would flee the service if streaming video proved unreliable. They needed a way to ensure the system was resilient enough that users could still access their content, even if portions of the cloud infrastructure failed.

The solution: “chaos engineering.” By randomly shutting down parts of the infrastructure, Netflix engineers found they could discover weaknesses and engineer solutions before disaster struck. “Since no single component can guarantee 100% uptime (and even the most expensive hardware eventually fails), we have to design a cloud architecture where individual components can fail without affecting the availability of the entire system,” the company wrote on its blog in 2011. “In effect, we have to be stronger than our weakest link.”

Today, using chaos engineering to ensure system resilience is widely practiced by the likes of Amazon, Facebook and Microsoft, among many others. But why stop there?

Why not apply chaos engineering to security, as well? Aaron Rinehart, chief enterprise security architect at United Health Group, raised this issue recently in a post on, writing that “Security Chaos Testing has shifted our thinking about the root cause of many of our notable security incidents and data breaches.” Instead of following a conventional “set it and forget it” security model, he now sees the need to “execute on continuous instrumentation and validation of security capabilities.” Indeed, noting the “usefulness” of Security Chaos Engineering, the influential website, Thoughtworks, recently elevated the concept to the next stage on its Technology Radar.

At General Dynamics Information Technology, we believe that same usefulness can greatly benefit our government customers.

Most government agencies use static application scanning at several stages to test and ensure systems are free of vulnerabilities and secure from cyberattacks. But static scanning tests your application code in isolation from your actual installation. It can’t anticipate the dynamic behavior of the application itself.

This is a critical issue, because as any software security expert knows, security vulnerabilities are often built on top of each other. An attacker seeking to penetrate a system will seek to exploit a feature that, taken alone, is not a security vulnerability, but through careful manipulation, may ultimately provide a means to break into the system.

This kind of activity, as well as other environment-related issues, can’t be caught through static testing, because it depends on the dynamic environment in which the system operates.

Because Security Chaos Engineering tests the application in its production environment, it can identify problems that other kinds of testing will miss, such as issues with load balancers, network configurations and communications with other systems.

By applying one hypothesis after another, engineers can test applications while the system is running and mimic attackers operating from the outside by launching attacks to exploit timing delays and seams between systems that would otherwise be almost impossible to identify.

Static testing is ideal for known vulnerabilities within the application, and closing those security gaps is essential. But to stay ahead of ever-more-agile attackers, Security Chaos Engineering can help us identify vulnerabilities in the environment the application is running in, highlighting weaknesses that weren’t known or anticipated. As a result, we can close loopholes before they’re ever discovered by attackers.

At GDIT, we embrace the concept of continuous security and believe it is the future of secure software development, working at the speed of DevOps. Leveraging the Chaos Engineering principles pioneered by Netflix for resilience and by Aaron Rinehart for security adds an important building block for end-to-end security and can help ensure that government applications in the cloud will be secure, reliable and nimble enough to adapt to changing needs and threats.