Fostering Software Reliability in an Increasing Hostile World

Greg Utas - Pentennea Inc.

Introduction. I believe that standards that specified the following capabilities would help to foster the production of reliable software:
1. handling signals as C++ exceptions
2. cooperative scheduling
3. proportional scheduling

These capabilities are currently absent from both the POSIX standard and the carrier-grade Linux initiative. However, they are important to large, scalable servers, which are typically soft real-time systems that must provide near-continuous availability.

Signal Handling in C++. POSIX discusses signals at some length. But because POSIX is language independent, it does not deal with signal handling in C++. Consequently, this would probably have to be formulated as a C++ standard.

The general requirement is the ability for signal handlers to integrate signals with C++ by mapping them to exceptions. This is possible in some environments but depends on the combination of compiler and operating system. The issues include
Cooperative Scheduling. POSIX limits itself to preemptive, priority, and round-robin scheduling. This was the outcome of adding a hard real-time requirement (priority scheduling) to timesharing policies (round robin with preemption). However, cooperative scheduling improves the reliability and capacity of soft real-time systems by allowing threads, instead of the scheduler, to decide when task switching can occur. This reduces the amount of task switching. More importantly, by allowing transactions to run unpreemptably, it eliminates the error-prone practice of having to identify and protect critical regions at a granular level.

The general requirement is the ability for a thread to lock, which suppresses preemption until the thread "cooperates" by unlocking, allowing other threads to run. This raises at least two issues:
Proportional Scheduling. This scheduling discipline is also important in soft real-time systems, in which one thread should rarely have absolute priority over another. For example, say that a system assigns payload work absolute priority over administrative work. If it then receives more work than it can handle, its operators are apt to reboot it when it fails to respond to their console commands.

Here, the general requirement is the ability to assign a faction, rather than a priority, to each thread. CPU time is then apportioned among the factions by giving each one a subset of the timeslices that recur over a broader time interval. This eliminates task starvation by allowing payload work to receive most of the CPU time while still allotting some time to administrative work. The details include
Conclusion. The most desirable outcome would be to specify the above capabilities in a way that allows various run-time environments to implement them in a native manner. An alternative would be to specify a lower-level set of capabilities that would allow wrapper classes to implement the higher-level capabilities in a way that would be portable across various environments.