Fostering Software Robustness in an Increasingly Hostile World

OOPSLA 2005 Conference, San Diego, CA
Wednesday October 19, 2005

Panel members:
Brian Berenbach, Siemens
Djenana Campara, Klocwork
Richard Gabriel, Sun
Ricardo Lopez, Qualcomm
Dave Thomas, Bedarra
Greg Utas, Pentennea
Steve Fraser (moderator), Qualcomm

Panel session notes by Dennis Mancl (mancl@lucent.com).

[Note: This is not an exact transcript of the panel discussion -- not even close. They are my notes about each question and response. I have done a lot of paraphrasing of the responses. Each question is marked with a "Q", and the responses are marked with the first name of the panelist; Some of the questions are summarized to save space. Every response summarizes the content of the panelist's response, frequently capturing the key words that they used.]

[This writeup is Copyright (c) 2005 Dennis Mancl. Permission is granted for anyone to reproduce or excerpt these notes for any non-commercial purpose, as long as credit is given to the original author.]

Opening statements

Djenana Campara. Enterprises and developers are in a new area of development. Hackers and poor software quality are everywhere. There is 20% annual growth in attacks, and 35% of attacks are due to poor software quality. There are 6 to 7 defects per 1000 lines of code, so a million line application probably has 6000-7000 defects, and even if only 1% of defects are security vulnerabilities, that is 60-70 vulnerabilities. We need a systematic approach to quality problems. And we should be aware that there are terrorists who are urging a holy war in cyberspace -- attacking vulnerable financial systems.

Ricardo Lopez. "Hostile" is in more than one dimension: security and also economic hostility. You want to make your software better than the competition, with more features and better performance. This is hostile to quality.

Since software is more ubiquitous, it means a simple failure can have grave consequences. It has been called the Promethean crisis: Bring fire to society, and maybe we can't stand it. There are lots of opportunities to foster robustness. Changing the way we think can help -- the people who are doing development should be assets rather than expenses. Also, when someone fails, it's an opportunity for learning. Embrace limits -- and find a way to go beyond them.

Greg Utas. A definition of robustness: the system stays in service in the face of difficulties, such as hardware failure, software problems, and human mistakes.

Telecom systems are what I have worked on, and they face challenges: to stay in service even with bad pointers and memory leaks. The reliability needs are challenges to startups as well as to established telecom firms.

There are lots of common practices from hard real-time that are just too costly for telecom's soft real-time, and many startups are doomed before they discover this.

There is an organizational issue: nobody thinks they are in the software business. Telecom companies think their products are boxes and solutions, even though most of the professionals working in the companies are software staff. Software becomes the handmaiden (a poor handmaiden) to product management.

Also, there is no career path for software people. You need to attract good software people and get them to stick around.

Time to market is a big pressure. Why should we waste money on robustness until we know if a product will be successful? But then it is too late to get robustness into the architecture.

Brian Berenbach. I'm in a very big company that is in a lot of critical businesses (telecom, medical systems, etc.). There are 65,000 software developers in Siemens.

There are three kinds of hostilities: Organization hostility, Arrogance, and Complexity. These conspire to cause problems. You need personal integrity to overcome organizational hostility.

I have three stories: good, bad, and ugly. First one -- I was involved as an architect in the design of gas-cooled nuclear plants. The physical design was going to fail catastrophically, because of big temperature differences between the internal system and the external cooling water. I spoke up, got the project cancelled, and was fired. This was the "good" one, because I did get the project cancelled.

The bad one: I had another project in another company where I spoke up about software quality problems -- and got fired without the project being cancelled.

Ugly... I took a look at the current User Guide and Reference Manual books for the Rational Unified Process. The words "Quality" and "Metrics" are not in the index. It's someone else's problem.

I sometimes refer to the RUP as "the necronomicon of software development".

Dave Thomas. The biggest problem with quality is the Vice President of Quality. There is usually lots of process, and I think CMM is a fraudulent activity. Motorola had to do a 10X project to speed things up after they installed Six Sigma.

I think that we should use all of the extra power of multi-core machines and other new hardware for adding reliability. For example, use extra bits for buffer overrun protection. Also, I think KLOC kills more than anything else -- better to keep things smaller.

Work on commercial grade technologies, with no spec and no test suites, makes it extremely difficult to do your work. You can't be building on top of things that are underneath that can change. It has been a mistake to push class libraries instead of interfaces.

The need to encode domain knowledge is increasing exponentially, so we need to put more power into the hands of end users. [This statement echoed the keynote address by Mary Beth Rosson about end user programmability.]

Dick Gabriel. I heard hostility analogized to terrorism. Although people were killed in New York in 2001, and there were design problems with the building that caused some deaths, the city as a whole survived.

Software is built out of things that are not real. It can't act like other building materials. It can only be robust in ways totally different from material things.

Q1. Are there ways we might introduce more bugs if we change the base code to add reliability?

Greg. There is something called verification inertia. Once something works, people are reluctant to change it. We should have continuous change to rejuvenate the software, and we need automated tests to be sure we have changed it correctly.

Djenana. I have an anecdote. A VP of R&D for one company said "I would like to understand why new hires are stupider and stupider." The answer is "It's your code." (The extra complexity that your code has been accumulating over time that erodes quality and is a bigger and bigger obstacle to new people on your team.) You can think of these as "hairballs" in the system. These will increase the cyclic dependencies in the system, so when we find 1 bug, there may be 5 more attached.

Engineers are forced to understand too much, which makes them look stupid (even though they are pretty sharp). We need a way to bring visibility to these problems, something like static analysis technology. [Note: Djenana is CTO of Klocwork, one of the top static code analysis tools vendors.]

Brian. Software is like making a painting. An artist "signs" his/her work (with pride). There is no way I'm going to let something go out the door with flaws -- of course, I might get pressure from VPs.

Q2. (Carl Alphonse, University of Buffalo) What should we do? What should educators do?

Dick. What we can do -- start working on it. Look to biological systems and how they do robustness: feedback loops and gradients. Genetic programming is an approach -- some parts of the system aren't designed by people because they are too big and complex. And we can try to separate the "computer like" parts from the robust parts.

Ricardo. I have a story. Richard Feynman talked about a paradigm shift, when he was asked "how to you see the future evolution of spacecraft." He said "We are going to have to learn how to grow them." Think about a combination of process, systems, and tests, as well as genetic approaches -- putting stuff out in a hostile environment.

Brian. At a university level, there are some ideas from Lutz at Iowa State. The majority of defects can be traced back to requirements. And students don't get training in requirements engineering. [Brian ran a full-day tutorial on requirements engineering at OOPSLA this year.]

Another important thing for university teaching: Make sure that Quality is in the index of your textbook.

Dave. Teaching configuration management isn't exciting. Teaching quality isn't popular either. I think that "cooperative education" -- students working in organizations make them appreciate the value more than work in the classroom.

Test driven development is good for young developers.

At Carleton, we have combined degrees in business and computer science, engineering and computer science, and others. Most computer scientists don't know about any domains (ledgers, math for electrical engineering, physics, and so on).

Q3. My boss told me "it's a throwaway application". I ignored him, and the software I wrote is still living. Do you have some robustness examples?

Ricardo. Robustness starts at software architecture. Writing code without architecture is like a biological system with no checks.

We are aiming for elegance. I think it's difficult for C++ to be elegant.

Note that "testing architectures" is very different from testing code.

Dick. How do you build a high quality house? You hire a good construction company and architect, who overdesign the house and use good quality materials. We can't say the same thing about software today.

Brian. When your boss says "quick and dirty", ask him/her to "put it in writing".

You can also point out to your boss the ROI of doing a good job. It's good to speak in business terms to your management, not in computer terms.

Q4. The panel paints a picture of software becoming a safety critical component. In other industries, such as automobile manufacturing and airlines, government regulation is very important. Will regulation become important for software?

Ricardo. Five hundred years ago, anyone could build a bridge. But after a lot of bridges fell down, we started licensing engineers.

Note that software failures are more complex. It's often a chain of events, a chain of little failures, that leads to a system failure. It's because everything is more integrated.

Brian. We already have FDA, FAA, and the National Safety Administration. Some regulation is a good thing, but it won't stop flaws. It does give us (developers) a hammer over our managers.

Djenana. I believe in regulations and accountability. I've had customers that two years ago turned us away, but now ask us to come over and help. When I ask what has changed, they say Sarbanes-Oxley was the difference. [The Sarbanes-Oxley Act is a US law passed in 2002 that created stiffer auditing requirements for publicly-held US corporations.]

Greg. I don't believe in regulation. It's one size fits all, which has unintended consequences. I think contracts and legal action is preferable to imposed solutions. Some customers want more robustness than others.

Brian. Lawsuits are after the fact. You need to have something that is more proactive.

Q5. Architectures rot and institutions rot. Is there anything better than regulations for fostering robustness?

Djenana. Big organizations (large consumers of software products) can use their economic leverage -- they can delay deploying a product until the developing company builds some trust in the product.

By the way, current laws do prevent us from doing binary-level testing of third party products without a licensing agreement -- because the law considers this to be equivalent to reverse engineering.

Dave. Developers are not enthralled with internal quality groups. They are skeptical of "I'm from SQA and I'm here to help".

Continuous integration is a practice that can help. For example, it can expose the flaws in an organization that has implemented Six Sigma, by instrumenting things in the process and making data available: everyone will know what things in the code really work, and it makes everyone aware of bad code smells.

Q5. (Dennis Mancl, Lucent Technologies) What is the impact on reliability of outsourcing and global development?

Brian. The problems come from a lack of experience. We need to overcome the problems that we have with system specs, which we used to be able to solve by talking to someone around the corner.

We need some communication techniques and we need to be proactive.

There are some important cultural differences to be aware of. When I've worked with development teams in India, they wouldn't ask questions when they didn't understand something. We needed to ask *them* questions to check their understanding. We had a couple wrecks and disasters before we learned to get it right.

Dick. The Gerald Sussman talk (invited talk on using computing to express ideas more precisely) pointed out that the mathematical notation of physicists is sometimes ambiguous, and when you convert the expressions to a more expanded form, they lose clarity. All human communication has problems.

Djenana. Our experience is that if you have "garbage out", you will have "garbage squared back in".

Old code is rebellious to change.

If you have a system that needs 24 hour attention to keep it going, management will be tempted to look to do it cheaper elsewhere. But this will cause problems. If the home team that designed it is struggling, it will be worse for a team in another country, with less knowledge of the design.

Greg. Outsourcing piece parts is less successful than outsourcing a whole product with total responsibility. (Multi-site development precludes face-to-face interaction.)

Dave. You should consider having people who are colocated in the same mental space, not necessarily the same physical space. For example, open source projects succeed with geographic distribution. One point about outsourcing -- outsourced projects will use the same lousy tools and have the same lousy management.

Q6. I'm wondering if the world would be better if we made software supportable.

Greg. The question is rhetorical. One issue is that many organizations fail to have a succession plan. You can plan to have expertise in various areas and a buddy system for handoff.

Ricardo. Exploratory software doesn't have to worry about this. But otherwise, yes, you need supportability.

Q7. (Rik Smoody) The discussion of organic systems and hairballs makes me remember that in biology 99+% of species are extinct. So when should we decide to put a piece of software in a sterile unchanging environment and start working on its replacement?

Dick. A crucial component in living things is cell death. Cells are programmed to have an external tag that can be used to find old cells to clean up. The cell death process is set up to not create toxins.

So I think we could use programs designed to kill hairballs.

Dave. There is lots of experimental data to support Gerry Weinberg's rule of 3. Most folks know that release 3 triggers the start of the next product. When you get to release 5 or 6, you are in the hairball.

Final statements.

Brian. I heard that the problem of complexity is something that no one has solved.

Personal integrity is important. Everyone has to look at themselves in the mirror and be proud to put their name on their product, which will make quality improve.

Greg. Reengineering is a challenge to do:

Djenana. Three messages: accountability works, management needs to be more proactive, and you need to manage software as assets.

Dave. An old technique is a competitive biological approach. Use 2 or 3 teams to build multiple models and compare.

Ricardo. Processes aren't organic today. I don't find that cultural differences are that big -- they don't make a big difference in quality. I see some good ideas for now, the near future, and the far future. Now - agility, accountability, and integrity. Near future - specifications and architecture. Far future - organic software.

Dick. I feel we have made huge systems successfully. I am happy.