reliable drivers for monolithics #linux

5.17: summary:
. many of the crashes we endure on rock-solid linux
are due to device driver errors .
. here are my edits of several papers that offer hope
by transitioning drivers to an event-based architecture .

4.23: news.cyb/sec/reliable drivers for monolithics:
Leonid Ryzhyk(Ph.D. 2009)`
On the Construction of Reliable Device Drivers
Previous research on driver reliability
has concentrated on detecting and mitigating
defects in existing drivers
using static analysis or runtime isolation.
In contrast,
this dissertation presents an approach to
reducing the number of defects
through an improved  device driver architecture
 and development process.
This implementation allows our reliable drivers
to coexist with conventional Linux drivers,
providing a gradual migration path .
4.24: Dingo:
Taming Device Drivers:
Device drivers are notorious for being
a major source of failure in operating systems.
In analysing a sample of real defects in Linux drivers,
we found that a large proportion (39%) of bugs are due to
two key shortcomings in the device-driver architecture
enforced by current operating  systems:
poorly-defined communication protocols
between drivers and the OS,
which confuse developers and lead to  protocol violations,
and a multithreaded model of computation
that leads to numerous race conditions and deadlocks.
We claim that a better device driver architecture can
help reduce the occurrence of these faults,
and present our Dingo framework as constructive proof.
Dingo provides a formal, state-machine based, language
for describing driver protocols,
which avoids confusion and ambiguity,
and helps driver writers implement correct behaviour.
It also enforces an event-driven model of computation,
which eliminates most concurrency-related faults.
Our implementation of the Dingo architecture in Linux
offers these improvements,
while introducing negligible performance overhead.
It allows Dingo  and native Linux drivers to coexist,
providing a gradual migration path
to more reliable device drivers.

driver protocols in Dingo are specified using a
state-machine-based formal language
called Tingu.[dingo cub]
Dingo addresses concurrency issues by defining
an event driven model of computation for drivers,
where the driver interacts with the system
through a series of atomic operations.
This model avoids the synchronisation complexity
of a multithreaded model,
and eliminates many concurrency bugs.

Any non-trivial device driver is handling I/O requests,
processing interrupts, and dealing with
power management and hot-plugging events.
Most operating systems are designed to run these activities
in separate threads that invoke the driver in parallel.
This multithreaded model of computation
requires the driver to protect itself from race conditions
using a variety of synchronisation primitives.

In addition, a driver in the kernel environment
has to keep track of the synchronisation context
in which it is invoked. For instance,
a driver running in the context of an interrupt handler
is not allowed to perform any potentially blocking operations.
Concurrency management accounts for
19% of the total number of bugs.

In Figure 2 we see that the rate of concurrency bugs
is higher in USB drivers (26.5%)
and IEEE 1394 drivers (23.5%)
than in PCI drivers (9%).
USB and IEEE 1394 buses support hot-plugging,
seen by the driver as a device disconnect event .
Disconnect happens asynchronously to
all other activities,
causing race conditions in all USB and IEEE 1394
drivers covered by our study.
In addition, since these buses are not memory mapped,
communication with the device is based on
asynchronous messages,
which adds another degree of concurrency to the driver logic.

Statistics for different types of concurrency faults
are shown in Table 3.
From this we see that concurrency faults are
mostly introduced in situations where a sporadic event,
such as a hot-unplug notification
or a configuration request,
occurs while the driver is handling a stream of data requests.

An event-driven architecture for drivers
The concurrency problems highlighted earlier
are not unique to device drivers.
In a multithreaded environment,
concurrent activities interleave at the instruction level,
leading to non-determinism and state explosion.
As a result, many programmers are generally
ineffective in dealing with threads,
which makes multithreading the leading source of
bugs in a variety of applications,
including OS kernels.

An alternative to multithreaded concurrency
is event-driven concurrency.
In the event model,
a program executes as a series of event-handlers
triggered by events from the environment.
Events are strictly serialised,
thus replacing instruction-level interleaving
with event-level interleaving.
Serialisation guarantees that the state of the program
observed at the start of an event
can be modified only by the current event handler.
This simplifies reasoning about the program behaviour
and reduces the potential for race conditions and deadlocks.

Comparison of threads versus events has been
the subject of lasting debate in the systems community
[Lauer 1978, Adya 2002, von Behren 2003].
One point of consensus in this debate
is that different applications may favour different models.
We argue that in the context of device drivers
the event model is preferred;
becuase, it eliminates most concurrency-related bugs
and can be implemented in a way that
neither complicates driver development
nor incurs a performance penalty.

. modern device drivers are already partially event-driven
as asynchronous handling of I/O requests
improves performance by parallelizing
I/O and computation.
. asynchronous handling by the driver
means the I/O request is added to a queue
and immediately returns control to the caller.
Later, it receives a completion notification from the device
and invokes a completion callback provided by the OS.

. splitting long-running operations into
request and completion steps
is a typical pattern for event-driven systems.
Thus, while current drivers do not fully exploit
the advantages of the event-driven model,
this style of programming is already familiar
to driver developers .

Event-driven architecture of Dingo
. because message handlers are handled quickly
(all they do is put an item in a queue)
the system can afford to let them work uninterrupted;
[an uninterrupted operation is said to be atomic
since it can't be broken down into steps]

The atomicity guarantee means that
no new message can be delivered to a driver
while a previous message handler is running.
This prohibits simultaneous invocations of a handler
by different threads,
as well as recursive calls from the same thread.
Note that the atomicity constraint does not
prevent the driver from being invoked from
different threads or different CPU cores,
as long as all invocations are serialised.

. by handling requests atomically
there can be concurrent message handlers
without the need for
thread-based synchronisation primitives
(spinlocks, mutexes, wait queues).
. the driver need only synchronise
those tasks that span multiple messages.
The number of cases where such synchronisation is required
is much smaller than in multithreaded drivers .

The event-driven architecture also simplifies
the use of I/O memory barriers.
In paticular,
barriers that order accesses to I/O memory from different CPUs
can be moved from the driver into the framework.
On architectures that
require barriers to order I/O memory accesses on a single CPU,
the programmer is still responsible for
correctly placing such barriers.

. since message handlers are not allowed to block,
there is no way for the driver to
freeze its execution waiting for an event,
such as a timeout or a hardware interrupt.
the driver has to complete the current message handler
and later resume execution in the context of a new message.

Splitting a single operation into a chain of completions
leads to complex and unmaintainable code
—the effect known as stack ripping [Adya 2002].
The cause of the problem is that
by splitting a function into multiple fragments
we lose compiler support for automatic stack management
and mechanisms that rely on it,
such as control structures and local variables.

Fortunately, event-driven programming can be
combined with automatic stack management.
One way to achieve this
has been demonstrated by Tame [Krohn 2007],
a C preprocessor providing event-driven programs
with a sequential style of programming.
We have implemented a similar approach in Dingo.
Our preprocessor provides several macros
that enable driver control logic to be
expressed in the natural sequential way,
avoiding stack ripping.

To illustrate this solution,
consider the following excerpt from a Linux driver,
which polls a device status register until
its value satisfies a certain condition.
The msleep function causes the current thread to
block for ten milliseconds before retrying the read:
do { msleep (10);
/* read status register */
... } while (/*condition*/);
Corresponding code in a Dingo driver looks like this:
do { CALL (timeout (10, &notif), notif);
/* read status register */
... } while (/*condition*/);
The CALL construct, when expanded by the preprocessor,
calls the timeout function to start a timer,
then saves the state of the current function,
including arguments, local variables
and the instruction pointer in a continuation structure,
registers this structure as the handler for the notif event
and returns from the current message handler.
When the timer triggers the notif event after 10 milliseconds,
the continuation fires,
restores the state of the function
and continues from the next statement following CALL.
The key to this solution is that
the state of the execution stack
is stored in continuation structures
and is automatically reconstructed by CALL statements.
The programmer can rely on automatic stack management
and can use local variables and C control structures
in the usual way.
Other constructs supported by the preprocessor include EMIT,
which triggers an event, and AWAIT,
which suspends the program
until one of a set of events occurs.

Selectively reintroducing multithreading:
One limitation of the event-driven model
is that it prevents the program from exploiting
multiprocessor parallelism.
This is not an issue for the vast majority of drivers,
which are I/O-bound rather than CPU-bound.
. a careful implementation of the event model
enables event-based drivers to achieve
the same I/O throughput and latency
as multithreaded drivers.
there do exist devices whose performance
could potentially suffer from request serialisation,
(eg, 10Gb Ethernet or InfiniBand controllers).
Although in our experiments
event-based drivers perform well even for these devices,
we would like to allow driver developers to
use multithreading when absolutely necessary.

We observe that high-performance devices
are designed to minimise contention
and avoid synchronisation in the data path.
As a result, the synchronisation complexity in their drivers
is concentrated in the control path,
whereas the data path is free of synchronisation operations.
Based on this observation,
we introduce a hybrid model in Dingo,
where all control messages are serialised
with respect to each other and to data messages.
while multiple data messages
are allowed to be handled concurrently.

The driver implementer can choose whether
the driver should run in the fully serialised mode
or in the hybrid mode.
Drivers running in the hybrid mode
benefit from the advantages of the event-driven model
without experiencing any added overhead of serialisation.
The distinction between data and control messages
is drawn by the protocol designer
who labels messages that can be sent or received concurrently
with a special keyword.
We have implemented both modes for the InfiniBand
driver described in Section 8.
Our original implementation was fully serialised.
We found that no changes to the driver were needed to
run it in the hybrid mode,
since the data path of the driver
did not require any synchronisation

Leonid Ryzhyk, Ihor Kuz, and Gernot Heiser.
Formalising device driver interfaces.
In 4th PLOS, Stevenson, Washington, USA, Oct 2007.
. The lack of well-defined protocols
for interaction with the operating system
is a common source of defects in device drivers.
. we investigate the use of a formal language
to define these protocols unambiguously, and compactly,
so as to be readily understood by software engineers,
conveying all important requirements
for driver behaviour .

It is intended to close the communication gap
between OS and driver developers
and enable more reliable device drivers.

 Failing to comply with a protocol leads to
subtle errors that are often manifested in corner cases,
when the driver handles an uncommon combination of events,
e.g., receiving a hot-unplug notification
while processing a power management request.
Such errors are particularly difficult to detect during
testing and code revision.
We believe that this type of defect
is caused in part by the lack of a precise specification
of how a correct driver should behave.
Typically, driver interfaces are defined in terms of
a set of functions that the driver must implement
and a set of callbacks that the driver can invoke.
but without specifying constraints on
ordering, timing and arguments of invocations .
. inferring these constraints leads to errors.

As part of an effort to develop a
highly-reliable I/O framework,
we are investigating the use of a formal language to
specify device driver protocols.
The requirements for such a language are that it
(1) is sufficiently expressive to capture
important constraints of the protocols,
(2) is readily understood by driver developers
(3) provides mechanisms for modular
specification of complex protocols.

We found that existing languages for
software protocol specification
do not satisfy these requirements.
As a result we have developed a new specification language
called Driver Protocol State Machines (DPSM).
DPSM is a visual language,
which uses a subset of the Statecharts [4] syntax
and enhances it with several new constructs, namely
protocol dependencies, subprotocols
and protocol variables.
DPSM specifications are readily understandable
and can be used as central design documents
that guide the development of both device drivers
and the I/O framework itself.
The language has well-defined semantics,
which enables its use for automated
static and runtime analysis of driver behaviour.
The semantics are expressed by means of
translation from DPSM specifications
to terms in the CSP process algebra [5].
The design of DPSM is driven by
our experience specifying and implementing
real driver interfaces.
We introduce a construct to the language only if
it has proved necessary to model the behaviour
of several types of drivers
and can not be easily expressed with other constructs.
Note that DPSM protocols are device-class specific,
i.e. a protocol describes common functionality of
a class of devices, such as Ethernet controllers or USB hubs,
rather than a particular device model.
A protocol is defined by the OS designer
when support for a new device class is introduced to the OS.
Therefore, DPSM cannot be applied directly
to legacy systems that have been developed without
a rigorous model of driver behaviour in mind.
Such systems could gradually move to DPSM-defined protocols
as they evolve;
existing drivers would continue to use old protocols,
while newly-developed drivers would use DPSM protocols.
This approach is not uncommon.

In this paper we improve upon our earlier work [9]
by developing a complete DPSM syntax
and enhancing the language with dynamic constructs,
namely protocol variables and subprotocols.
In addition, we demonstrate the use of
DPSM on a realistic example:
a USB hub driver protocol.

The DPSM language has been developed in tandem with
Dingo — a user-level driver framework
that we are building for the L4/Iguana [6] embedded OS.
A driver in Dingo is represented as an object whose
functionality is accessed through ports (Figure 1).
A port is a typed bidirectional
message-based communication point
between the driver and the environment.
For instance, a driver for a USB hub has three ports:
the lc port of type Lifecycle for
messages related to driver lifecycle control,
the hub port of type USBHub
that exports the hub functionality to the OS,
and the usb port of type USBInterface
through which the driver communicates
with the USB bus driver.
In Dingo, drivers are single-threaded and non-blocking.
Operations that involve waiting for an external event,
e.g., a hardware interrupt,
are split into two non-blocking phases:
request and completion.
This design decision is motivated by reliability concerns.
Multithreading in driver framework APIs has been shown to
constitute a major source of driver bugs [2],
while the benefits of multithreaded drivers are minimal, if any,
since the performance of most drivers
is bounded by device rather than CPU speed .

The software engineering community
has developed a number of software protocol specification languages,
most noticeably UML Protocol State Machines (PSM) [8].
Like DPSM, PSM is based on the Statecharts visual syntax,
which allows modelling complex protocols
using hierarchy and concurrency
and makes PSM specifications easy to understand.
However, neither PSM nor other languages we have investigated
allow the expression of protocol dependencies,
protocol variables and subprotocols.
In the operating systems community,
the SLAM [1] and Singularity [3] projects have used
state-machine based formalisms
to specify driver interfaces.
Both formalisms provide a form of subprotocol spawning
but do not support protocol dependencies
nor infinite-state variables,
such as sets and queues
(Singularity does not support protocol variables at all).
As a result, they only partially capture
the requirements of driver protocols.
Besides this, neither SLAM nor Singularity
provide a means for structuring complex protocols
and do not aim to make protocol specifications
easy to understand and work with.
The reason for these limitations
is that SLAM and Singularity
use formal protocol specifications solely for
static-analysis purposes.
Properties that cannot be statically verified,
e.g., those involving infinite-state variables,
are excluded from the formalism.
Clarity and manageability are not among addressed issues.
In contrast, we separate specification from verification.
The primary purpose of DPSM specifications is to
serve as guidelines for driver developers.
As mentioned in the previous section,
they can also be used as properties against which
driver implementations can be statically verified.

The Devil [7] project has developed a language
for describing the functional interface of a device
in terms of its memory and register layouts,
for use by device drivers.
This is complementary to DPSM,
which is intended for specifying
the software interface of a driver.