Chaos Engineering Panel
ICSE 2016
Wednesday May 18, 2016, 4:00-5:30pm
Panel members:
Aaron Blohowiak - Netflix (moderator)
Lorin Hochstein - Netflix
Ian Van Hoven - Yahoo
Heather Nakama - Microsoft
Summary of Chaos Engineering Panel
Chaos Engineering is a set of testing approaches for finding
bugs in the interactions between software components.
Chaos techniques try to automatically exercise some of a
system's failure recovery functionality in testing, with
goal of improving software reliability.
Chaos tests trigger specific system failures.
Two kinds of failure tests are common:
forcing selected internal service requests to fail
within a microservices-based system, and
adding extra latency into service requests and responses.
It is common for these Chaos tests are run selectively on a production
system, but they could instead be run on a clone of the production
environment or in traditional system testing.
Chaos tests and experiments are typically run in an "automated
build and test environment, usually as part of a
Continuous Integration or Continuous Delivery process.
Some key parts of the definition of Chaos Engineering from the
panelists:
- Chaos Engineering is a way of doing experimentation.
We want to understand the behavior of the Netflix system by doing
experimentation directly on the system -- as it is running in
production.
- In Azure Search, we don't run our Chaos Engineering on actual
production services. We run it on services that are in production, but
that don't contain any customer data. We want to observe exactly how
our services will melt down when things go really wrong.
- Chaos Engineering tries to introduce the irregular occurrences more
regularly.
Chaos testing doesn't replace traditional integration testing, security
testing, and other regular testing processes. It supplements the
standard tests.
- We typically just run it [Chaos testing] in the Continuous Delivery
pipeline. As your software is making its way from "git" out to the
world, we will throw a bunch of TCP disconnects and latency at various
paths in your code... and make sure the fail-safes are run.
- In a production environment, there are so many unknowns. Trying to test
everything exhaustively is just a waste of time. Doing it this way
(Chaos) has been far more efficient for us than writing really granular
dedicated integration tests.
- We are focusing about Chaos here, but you are also seeing "canary
deployments", where you push out new code by deploying it to a small
number of servers -- only a fraction of the traffic gets it -- and you
compare to see that it behaves perfectly compared to the old one.
The panelists didn't have "hard numbers" on the utility of Chaos testing,
but they did some internal tracking of defects and field failures -- and
they managed to get buy in from their developers on the value of
this form of testing.
- I said if we found these bugs earlier, wouldn't that be better than
3 hours time for 5 people at 3:00am while customers are down? Wouldn't
it be nicer to figure out how to make it only one hour and one person
the next time it fails?
- I showed numerical evidence that this gives better service
quality. I don't have the numbers here to show you, but it has been a
huge tool to work with my team.
There were some good ideas on how to control the "scope" of the testing,
so they wouldn't have as much impact on real-world customers:
- One thing we do in Chaos experiments: we control how many users are
subjected to an experiment. Our goal is to make that test group as
small as we can. We are trying to reduce the "blast radius" by
inflicting the minimum amount of harm to customers when we do these
experiments.
- One of the things we do is to inject failures between two services.
In a remote procedure call, we can do either "latency" or
"failure". For example, "If service A calls service B, for users that
match these ids, then fail that request (return an error right away) --
or add 400 milliseconds of latency."
- [We designed a Chaos system that is] "opt in", so if you want to run it
on your system, you install a library and proxy your requests through
it. It has access to every element of that request -- it can terminate
a connection, inject latency, and so on.
What is needed to adopt Chaos Engineering?
- The first thing you need to do to adopt Chaos Engineering: observe
the behavior of your system.
- It is important to have a business metric for system behavior.
[What levels of customer service, ability to stream how many videos,
etc.]
- Another thing you need: you need to be able to design your system
for resiliency. If the system can't withstand failures, there is no
sense doing Chaos testing. [The system has redundancy, automated
failover, and other resiliency functionality.]
- The last thing you need: monitoring and alerts -- so that when
something bad is happening to your system, people get notified, so you
can minimize the outage. The last thing you need: monitoring and alerts
-- so that when something bad is happening to your system, people get
notified, so you can minimize the outage.
Can we use Chaos Engineering in life-critical systems? We probably
cannot do Chaos testing in a production environment for telecom,
air traffic control, heart monitors, and so on.
But in most non-critical systems that are using an agile development
approach and DevOps techniques, Chaos techniques can help improve
performance and reliability in the field.
- Most of the companies that I have worked at, the users want to see
new features and functionality, and they are a little bit more
forgiving about periodic downtime -- than they would be for a hospital
system or an electric company.
- We want to make availability to be just good enough -- so
availability is not one of the top 10 things people at Netflix complain
about.
- [A case study from Fidelity Investments...] Their system used three
mainframes -- one live and two standbys. With one of the standbys, they
loaded it with shadow traffic: real user data that had been
anonymized. So although they didn't do true "production", they got as
close they could in order to see if their system could handle an
increased number of trades. It's not a safety critical system, but it
is a system where the cost of bad things happening is very high. And
they used a lot of effort to make their testing as close to production
as they could.
How much work is involved in setting up Chaos tests and experiments?
It depends on the environment and the application infrastructure.
Some folks create some special libraries with testing hooks, other
people set up special proxy servers to add in selected errors and
latency.
The most important thing is to have adequate tools for monitoring
system behavior without being overwhelmed by error messages.
For more information on Chaos Engineering, visit the
Principles of Chaos website:
principlesofchaos.org.
Last modified: May 26, 2016