Chaos Engineering Panel

ICSE 2016
Wednesday May 18, 2016, 4:00-5:30pm

Panel members:
Aaron Blohowiak - Netflix (moderator)
Lorin Hochstein - Netflix
Ian Van Hoven - Yahoo
Heather Nakama - Microsoft

Summary of Chaos Engineering Panel

Chaos Engineering is a set of testing approaches for finding bugs in the interactions between software components. Chaos techniques try to automatically exercise some of a system's failure recovery functionality in testing, with goal of improving software reliability.

Chaos tests trigger specific system failures. Two kinds of failure tests are common: forcing selected internal service requests to fail within a microservices-based system, and adding extra latency into service requests and responses. It is common for these Chaos tests are run selectively on a production system, but they could instead be run on a clone of the production environment or in traditional system testing.

Chaos tests and experiments are typically run in an "automated build and test environment, usually as part of a Continuous Integration or Continuous Delivery process.

Some key parts of the definition of Chaos Engineering from the panelists:

Chaos testing doesn't replace traditional integration testing, security testing, and other regular testing processes. It supplements the standard tests.

The panelists didn't have "hard numbers" on the utility of Chaos testing, but they did some internal tracking of defects and field failures -- and they managed to get buy in from their developers on the value of this form of testing.

There were some good ideas on how to control the "scope" of the testing, so they wouldn't have as much impact on real-world customers:

What is needed to adopt Chaos Engineering?

Can we use Chaos Engineering in life-critical systems? We probably cannot do Chaos testing in a production environment for telecom, air traffic control, heart monitors, and so on. But in most non-critical systems that are using an agile development approach and DevOps techniques, Chaos techniques can help improve performance and reliability in the field.

How much work is involved in setting up Chaos tests and experiments? It depends on the environment and the application infrastructure. Some folks create some special libraries with testing hooks, other people set up special proxy servers to add in selected errors and latency. The most important thing is to have adequate tools for monitoring system behavior without being overwhelmed by error messages.

For more information on Chaos Engineering, visit the Principles of Chaos website: principlesofchaos.org.


Last modified: May 26, 2016