Workshop: Fostering Software Reliability in an Increasingly Hostile World

OOPSLA 2005, San Diego
October 16, 2005

Workshop Final Report

The main issues that came out of the workshop were in two main categories:  Economics and Communication.  This isn't that surprising, because reliability puts a lot of stress on organizational structures -- and these stresses often result in money and information flow issues.




Reliability:  Does it perform correctly?
Availability:  Percentage of time in service.
Robustness:  Combines reliability and availability -- availability in the presence of troubles.
Performance:  Ability of the system to do certain required work in a certain period of time.

Reliability units:  MTBF (mean time between failures)
Availability:  It's a function of MTBF and MTTR (mean time to repair)
Performance units:
Some items related to performance

Two Questions to address critical reliability:

1. Can it kill?
2. Can we be sued?

When Fostering Software Reliability -- which is more important?

There is a relationship between process & reliability
One recommended solution:  Put reliability issues in the hands of the Architect


There must be a deliberate investment in reliability:
This makes reliability work relatively costly -- but no more than safety and security analysis

But there is also an organizational dimension:
How can we defend development managers who want to do the right thing?
Sources of ideas for reliability assessment:
John Musa, Software Reliability Engineering
see also
also Failure Effect Mode Analysis (used in avionics)

Simplicity can have a positive impact on reliability issues
"Fewer components" can lead to higher reliability
But it also helps to have people who do the best possible development work:


Greg Utas, Robust Communications Software
Robert Binder, Testing Object-Oriented Systems
Gary Klein, Sources of Power
Malcolm Gladwell, Blink: The Power of Thinking Without Thinking

Training and education about reliability

Code reading and code reviews (good way to train new staff)
Any kind of "collaborative" efforts (design reviews, team design decisions)
Use tools to do static defect checking
(Klockwork, Covarity, CodeInspector, Fortify)
Look for coupling -- increases the likelihood of reliability problems

Reliability of Services vs. Reliability of Boxes

Service reliability:  this is what end users care about.
Most developers think mostly about box reliability (reliability of processor, component, subsystem)

Diversity and Redundancy help (sometimes)
Be careful in the analysis of the reliability of systems with redundancy and diversity.

Use Bayesian statistics -- remember that failures aren't always independent

Be concerned with "cascade failure" --> first failure increases the probability of more failures

When the system is in a "no safety margin" state, how do we get to a safe state?

Reliability and Outsourcing/Offshoring

When you outsource the development of a system, subsystem, or component, how do you make sure you get what you paid for?

One company has done this:  Mandate that the offshore organization develop a series of unit tests -- tests that cover all of the functional requirements.

There are some cultural issues with outsourcing:
In order to outsource successfully, must be able to write testable specs.

Important for success:  face-to-face relationship -- ability to discuss reliability expectations.

Another idea:  Use graphical representations for specs (UML, etc.)  -- this can improve understanding.

Last modified:  Oct. 17, 2005