Loko Scheme is an R6RS Scheme compiler that runs on Linux and bare metal. To be useful on bare metal it needs device drivers. One of the more interesting device drivers is the virtio net driver, which paves the way for more virtio drivers.

This is going to be a long one

Before explaining what virtio is, I will briefly go through how device drivers work in Loko. This is followed by descriptions of increasingly more modern hardware. The descriptions are intended to be didactic rather than historically accurate, so I have taken some freedom in my descriptions. It still gets pretty long, so scroll to the bottom if you’re just interested in virtio. If you do read the whole thing then feel free to fill the comments section with corrections, elaborations, details, anecdotes and actually…’s.

Driver infrastructure

Device drivers in Loko are written in Scheme and use fibers for lightweight concurrency. You can think of fibers in Loko as similar to the goroutines in Golang. Communication with the rest of the system happens mostly via channels. The channels are not buffered, so two fibers must rendezvous to exchange a message on a channel, i.e. one fiber needs to be in a put operation while another one is in a get operation. The API and design are heavily based on fibers for GNU Guile.

The very simplest of devices might not need fibers, but it is generally a good idea anyway in order to manage concurrent accesses to the shared resource which is the device. Loko Scheme does not use any locks, so fibers are basically the only choice.

Device drivers fundamentally act as bridges between two different parts of the system: a device on the one hand and the kernel or application on the other hand. A device can be anything like a network card, a disk controller, a hard drive, a keyboard, and so on. Often something you can pick up and (responsibly) toss in the garbage.

The kernel’s interface to drivers in Loko is based on channels. When a device driver is started it is given some channels to use as its interface towards the kernel. A simple way to represent a network card might look like this:

;; A network interface
(define-record-type iface
  (fields rx-ch tx-ch ctrl-ch notify-ch))

This is obviously too simple to have all the information needed for a full network stack, but the required channels are there. The driver needs to know where to put packets it receives from the network card, so it puts them on the rx-ch channel. The driver gets packets from the tx-ch channel and tells the device to transmit them. The control and notification channels will be used for everything else, like MAC filters and link notifications.

The device drivers in Loko are loosely coupled to the kernel via the channels. While Loko will be providing some higher-level interfaces on top of them, such as a network stack and file systems, it will still be possible to have your application access the drivers directly.

ISA devices

A modern PC comes with a bunch of legacy hardware from the days of the early IBM PC. The bus at that time was the ISA bus, so we call them ISA devices. Like an appendix or a tail bone, we might no longer believe them relevant, but they still have a role to play. Some of the more commonly used devices in this category are the PS/2 controller, the real-time clock (RTC) and the UARTs.

The PS/2 controller handled the keyboard and the mouse and often still has this role today on laptops. The RTC is wired to the CR2032 battery on your motherboard and keeps the time even when your computer is turned off. The UART handles the serial ports, if your computer has them.

These devices were usually some off-the-shelf components. The PS/2 controller was an Intel i8042, an 8-bit microcontroller with custom firmware and a serial interface. The RTC was something like a Motorola MC146818 and the UART was an NS8250 or PC16550D. Those who work with embedded systems today would find them large and power hungry, but otherwise approachable.

The devices are accessed over the I/O bus. The CPU is hooked up to this bus and has special instructions for using it. In Loko Scheme the API is in (loko system unsafe) and has these signatures:

;; Read from an I/O bus address
(get-i/o-u8 busaddr)
(get-i/o-u16 busaddr)
(get-i/o-u32 busaddr)
;; Write to an I/O bus address
(put-i/o-u8 busaddr n)
(put-i/o-u16 busaddr n)
(put-i/o-u32 busaddr n)

(There are no 64-bit accesses). ISA devices listen on some well-known I/O bus addresses that are usually adjacent to each other. Here is an example of how to set the baud rate on a UART:

(define (uart-set-baudrate i/o-base rate)
  (define LCR (+ i/o-base 3))
  ;; Accessible when LCR[7]=1
  (define DL (+ i/o-base 0))
  (define DH (+ i/o-base 1))
  (let ((latch (div 115200 rate)))
    (unless (<= 1 latch #xffff)
      (error 'uart-set-baudrate "Invalid baud rate" rate))
    (let ((lcr (get-i/o-u8 LCR)))
      ;; Set the Divisor Latch Access Bit
      (put-i/o-u8 LCR (bitwise-ior lcr #x80))
      ;; Latch the new baudrate
      (put-i/o-u8 DL (bitwise-bit-field latch 0 8))
      (put-i/o-u8 DH (bitwise-bit-field latch 8 16))
      ;; Clear the DLAB
      (put-i/o-u8 LCR (bitwise-and lcr #x7f)))))

The fact that the DL and DH registers only become accessible when setting another unrelated bit is typical of these devices. They tried to save on expensive address decoding logic, so having more registers than I/O addresses is common.

We know how to get data in and out of the device, we just need to read the datasheet and figure out what registers to poke. The missing piece of the puzzle is how to write power efficient drivers for these devices. It is possible to poll the device to see if it has data, but that is really wasteful. What we want is for the device to tell the kernel that it should run the driver.

This is handled with interrupts, of which Dijkstra said:

It was a great invention, but also a Box of Pandora. Because the exact moments of the interrupts were unpredictable and outside our control, the interrupt mechanism turned the computer into a nondeterministic machine with a nonreproducible behavior, and could we control such a beast?
– prof.dr. Edsger W. Dijkstra, EWD1303

Loko Scheme tries to take that beast by integrating interrupts in the fibers system. The wait-irq-operation procedure takes an interrupt request (IRQ) number and is blocked until that interrupt has fired. The fibers system makes it possible to compose arbitrary combinations of operations and perform them in one go, returning just one that finished. The driver would normally be written to wait for an interrupt at the same time as it waits for a message on one or more of the channels.

In particular, Loko does not use upper and bottom half interrupt handlers. There is no special interrupt context where your driver code runs but is forbidden from doing certain things, like in other kernels. The manual has more on Loko’s approach to interrupts.

ISA devices on PCI

At some point in the 90s everyone was ready for faster computers and that required faster busses. PCI came along with several improvements over the ISA bus. Besides being faster and having a better connector, the PCI bus is enumerable by software. There is actually a set of registers on the I/O bus that provides a window into the PCI configuration space.

The configuration space lets software select and talk with a device by its bus number, device number and function number. In Loko Scheme the (loko drivers pci) library provides an API for accessing it. The most common tasks are:

Enumerate the PCI bus with pci-scan-bus.
Find the device’s addresses on the I/O bus or the memory bus with pcidev-BARs.
Read PCI configuration space to see the device type and configuration with pcidev-vendor-id, etc, or directly with pci-get-u8, etc.
Write to PCI configuration space to enable I/O, memory, etc.

ISA device addresses were usually hard coded and had to be probed for and sometimes they had physical jumpers to set the I/O address and IRQ number. PCI changed that and instead the BIOS assigns addresses and IRQs to devices. The OS later enumerates the bus to find out what was assigned. (Linux assigns addresses again because the BIOS does a bad job. Sorry, BIOS, we know you tried).

The earliest PCI devices were more or less ISA devices with PCI bolted on. You could find a PCI UART card that uses PCI as a physical interface and for configuration, but otherwise just exposes the usual 16550 UART registers. PCI IDE devices are compatible with their ISA counterparts, and so on.

PCI bus mastering DMA

As people were ready for even faster computers, it became apparent that devices would need to do more work without involvement from the CPU. It is inefficient for the CPU to need to handle each byte of every transfer. ISA actually has a DMA controller to solve this problem. It is a special device that is programmed to perform transfers between the system memory and a device. But the ISA DMA controller is limited to 4.77 MHz and 8-bit or 16-bit transfers, so it was even slower at the time than doing the same transfers with the CPU.

Instead of improving on the ISA DMA controller (which also happened), the winning solution was PCI bus mastering. This is a way to let a PCI device become “master” of the PCI bus, which allows it to perform transfers between itself and the system’s memory.

One common PCI bus mastering device is the Realtek RTL8139 network card. The driver allocates a buffer in system memory and gives its address to the network card by writing to one of the registers. In Loko Scheme this is done with dma-allocate and dma-free. The RTL8139 can only use 32-bit addresses, so the driver passes a suitable mask to dma-allocate.

The driver then starts the receive part of the RTL8139. When there is a frame on the network the device writes it to the buffer using bus mastering DMA and fires off an interrupt. Frames are usually Internet packets encapsulated in an Ethernet header. The driver needs to handle the frame (put it on the rx-ch channel) and then advance some pointers in the device to let it know that there is more space available in the buffer.

Transmission of a frame is similarly simple. There are two sets of registers that are called “transmit descriptors”. They are basically 32-bit pointers and length fields. The driver copies the frame to a 32-bit address and sets up the next transmit descriptor. An interrupt can be fired when the transmit is done, if so desired. The RTL8139 reads the frame from system memory using bus mastering DMA.

The whole RTL8139 driver is currently less than 300 lines of actual code and is worth studying.

DMA queues

As people once again became ready for even faster computers, new hardware designs were needed. The RTL8139 is no good. It has a tiny buffer to handle incoming frames and can only have four outgoing frames pending.

The Intel 8255x (eepro100) network card demonstrates the next step in the evolution of letting the device handle work independently of the CPU. This card does not have only a single receive buffer and not a limited number of transmit descriptors either.

One unfortunate aspect of the RTL8139 is that frames can become unaligned, so they generally need to be copied out of the buffer. And if that were not a problem then the buffer easily gets full anyway. The eepro100 eliminates this problem by using a level of indirection.

The driver allocates system memory an area of receive descriptors. Each receive descriptor points to a buffer where the device will write a frame. When the device has written a frame, it updates the status fields of the descriptor and reads the link to the next receive buffer. If it reaches a descriptor with an end marker then it stops receiving and lets the driver know that it’s out of descriptors. (There is room for improvement here, but the later models briefly use a FIFO when this happens).

Transmit descriptors are also in memory. The driver sets up a circular linked list of transmit descriptors. In the RTL8139 there were only four transmit descriptors and they were all in registers. And the eepro100 transmit descriptors actually lift another limitation that I haven’t mentioned yet. In the RTL8139, all bytes being transmitted had to be consecutive in memory. This is not always the case with network packets, which might be assembled from separate buffers, going through several network layers that each prepend their own headers.

For the transmit part of the driver, it gets packets from the tx-ch channel and fills in the next available transmit descriptor. Then it just tells the device to resume transmissions. Some simple management of the descriptors is necessary to reclaim used descriptors, but this is pretty straightforward.

Virtio and virtqueues

After the 90s, people were once again ready for faster computers, which brought things like PCI Express and Message Signaled Interrupts (MSI). But other people who had noticed that PCs had gotten very fast now wanted to put them up in caves and rent them out, virtually. Thus the cloud was born.

Virtualization was primitive for a long time and basically meant you had something pretending to be a 1990s computer, but working even more slowly and poorly. As hardware improved, CPUs and even the devices grew direct support for virtualization.

But virtual machine hosts usually emulate real hardware, like the rtl8139 network card, along with a selection of worse and better cards. The same goes for everything else in a computer. There’s an emulated IDE controller, an emulated UART, etc. The really terrible part of this is that each device has its own register interface and the better devices all have their own idea about how DMA queues work.

Usually they do not minimize the number of register accesses either. When the hardware is virtualized, reading and writing to device registers means the CPU switches from the guest to the host and back to the guest.

The DMA queues are always custom and can get quite intricate. They can form the larger part of getting a driver working. Enter virtio and virtqueues.

Virtqueues are virtio’s standard for doing DMA queues. Each kind of virtio device has its own set of virtio queues. A network card has a receive queue, a transmit queue and a control queue. When it comes to the data structures and the queue registers, each queue works identically and transports blocks of bytes between the device and the system memory. This means that you can have a common library for setting up the queues that is then reused for all virtio device drivers.

A virtqueue is an area in the system memory with three parts: descriptors, an available list and a used list. Initially all descriptors are free. When the driver wants to do DMA transfers it takes free descriptors and fills them in with buffer pointers, buffer sizes, and transfer directions. It then takes the next slot in the available list and updates it to point to these transfer descriptors. It then tells the device to look in the list for descriptors that are available to it. When the device is done with the descriptors, it puts it in the list of descriptors it has used and optionally fires off an interrupt.

This simple system is then used to build all types of devices. Each device type has specific details as to what goes inside the transfers and which extra register are available, but these differences are pretty mild compared to having debug a way to do DMA queues that you’ve never seen before.

Nowadays there is even real virtio hardware.

Future work

More drivers! In particular, virtio drivers for disks, consoles and video.

The DMA allocator should have another allocator built on top of it, suitable for smaller allocations. Something like a SLAB, SLUB or even a SLOB allocator. Drivers often need to allocate small amounts of memory and the smallest allocation from the DMA allocator is 4096 bytes.

The network drivers are not hooked up to a proper network stack yet, so there is no TCP/IP. The S³ network stack needs a bit of work and it would be interesting to get it running.

Post script

I jokingly wrote “as people were ready for even faster computers”. But I find it funny how hardware seemingly got better just to the level that people were ready to accept. The concepts were already known. If you want to know what is next, then perhaps a glimpse into the future can be had through Guy Steele’s presentation of the Connection Machine CM-5 from 1992. What Is the Sound of One Network Clapping?.