Workshop: Fostering Software Reliability in an Increasingly Hostile
World
OOPSLA 2005, San Diego
October 16, 2005
Workshop Final Report
The main issues that came out of the workshop were in two main
categories: Economics and Communication. This isn't that
surprising, because reliability puts a lot of stress on organizational
structures -- and these stresses often result in money and information
flow issues.
Economics
- decide what to invest in
reliability analysis and modeling
- design in the right amount of
Diversity and Redundancy
- do correct analysis of
reliability of Services
- extra communications work (and
cost) for Outsourcing
Communication
- communication faces cultural
issues
- need to communicate context of
system to all
- need good written specs, and
need to supplement with face-to-face meetings
- training via code reviews for
new staff
- give reliability issues to
Architects, and give Architects real authority
Definitions
Reliability:
Does it perform correctly?
Availability: Percentage
of time in service.
Robustness: Combines
reliability and availability -- availability in the presence of
troubles.
Performance: Ability of
the system to do certain required work in a certain period of time.
Reliability units: MTBF
(mean time between failures)
Availability: It's a
function of MTBF and MTTR (mean time to repair)
Performance units:
- requests per second
- response time
- number of concurrent users
Some items related to performance
- Scalability - performance can grow
- Capacity - a measure of performance
per box
Two Questions to address critical reliability:
1.
Can it kill?
2. Can we be sued?
When Fostering Software
Reliability -- which is more important?
- process
- experience of developers
There is a relationship between
process & reliability
- a good process needs to expose
information that needs to be shared
- real risks, potential
failures
- some processes don't lead to
more reliable software
One recommended solution: Put
reliability issues in the hands of the Architect
- not an outside expert, a real
member of the team
- with real authority to make
design decisions
- avoid problems with multiple
managers having competing control
Economics
There must be a deliberate investment in reliability:
- it's a cost-benefit tradeoff
- lots of effort in risk analysis
- lots of effort in unit tests and (especially) acceptance tests
- need to define testable reliability characteristics in the
requirements
This makes reliability work relatively costly -- but no more than
safety and security analysis
But there is also an organizational dimension:
- reliability might not be a high value for development manager
- even if the executive managers say "five nines", development
manager might still have to push on schedule and resources
- who wants to extend the development schedule to add range checks?
(or other reliability functionality)
How can we defend development managers who want to do the right thing?
- need to talk to managers about risks: and quantify the risks
- so they can go back to executives and say "here is the cost of
bad reliability"
Sources of ideas for reliability assessment:
John Musa, Software Reliability
Engineering
see also http://members.aol.com/JohnDMusa/ARTweb.htm
also Failure Effect Mode Analysis (used in avionics)
Simplicity can have a positive impact on reliability issues
"Fewer components" can lead to higher reliability
- "As simple as possible, but no simpler."
But it also helps to have people who do the best possible development
work:
- Attention to Detail -- more important than a specific process
Books
Greg Utas, Robust Communications
Software
Robert Binder, Testing
Object-Oriented Systems
Gary Klein, Sources of Power
Malcolm Gladwell, Blink: The Power
of Thinking Without Thinking
Training and education about reliability
Code reading and code reviews (good way to train new staff)
Any kind of "collaborative" efforts (design reviews, team design
decisions)
Use tools to do static defect checking
(Klockwork, Covarity, CodeInspector, Fortify)
Look for coupling -- increases the likelihood of reliability problems
Reliability of Services vs. Reliability of Boxes
Service reliability: this is what end users care about.
Most developers think mostly about box reliability (reliability of
processor, component, subsystem)
Diversity and Redundancy help (sometimes)
- Redundancy -- multiple copies of an processor, object, etc. (can
be used for failover)
- Diversity -- different components that can provide the same
service
Be careful in the analysis of the reliability of systems with
redundancy and diversity.
Use Bayesian statistics -- remember that failures aren't always
independent
Be concerned with "cascade failure" --> first failure increases the
probability of more failures
When the system is in a "no safety margin" state, how do we get to a
safe state?
Reliability and Outsourcing/Offshoring
When you outsource the development of a system, subsystem, or
component, how do you make sure you get what you paid for?
One company has done this: Mandate that the offshore organization
develop a series of unit tests -- tests that cover all of the
functional requirements.
There are some cultural issues with outsourcing:
- supplier of a component might not really understand the
requirements and the context
- one idea: get the supplier to explain their understanding
- in some cultures (India, USA), there is a reluctance to ask
detailed questions
- other cultures (China) will always ask questions to get complete
understanding
In order to outsource successfully, must be able to write
testable specs.
Important for success: face-to-face relationship -- ability to
discuss reliability expectations.
Another idea: Use graphical representations for specs (UML,
etc.) -- this can improve understanding.
Last modified: Oct. 17, 2005