Non-POSIX file systems

Written by Göran Weinholt on 2020-08-30

Operating systems and file systems have traditionally been developed hand in hand. They impose mutual constraints on each other. Today we have two major leaders in file system semantics: Windows and POSIX. They are very close to each other when compared to the full set of possibilities. Interesting things happened before POSIX monopolized file system semantics.

When you use a file system through a library instead of going through the operating system there are some extra possibilities. You are no longer required to obey the host operating system’s semantics for filenames. You get to decide if you use / or \ to separate directory components (or something else altogether). Maybe you don’t even use strings for filenames. The fs-fatfs library uses a list of strings, so it’s up to the caller to define a directory separator for themselves. While working on that library, I was driven to write down some ideas that I’ve previously run across and found inspirational.

Multics

The very first hierarchical file system was developed for Multics. It is described in the paper A General-Purpose File System For Secondary Storage (1965) by Robert C. Daley and Peter G. Neumann. There are several things that I find astounding about this paper:

  • There were apparently no hierarchical file systems before Multics. The references do no cite any previous work on this and I haven’t found any.
  • Privacy in computing was a big consideration even back then.
  • Modern hierarchical file system are essentially unchanged from 1965, but some features disappeared along the way.
  • There’s a very interesting backup system described in this paper.

The paper describes these concepts that we still use today:

  • Files.
  • Directories.
  • Root directory.
  • Working directory.
  • Tree name, i.e. the name of a file or directory in the tree structure. Akin to realpath(3).
  • Path name, which is like a tree name, but it is conceptually different because components can be links. The paper used ":" instead of "/" to separate directories.
  • Relative paths. They used an initial ":" in place of "./" (or empty) and "*" in place of "..". Instead of "cd .." you used "CHANGEDIRECTORY :*".
  • Symbolic links.
  • Access control using access control lists (showed up relatively recently in POSIX).
  • Read, execute and write modes on files and directories. Execute on a directory means permission to search it, just as in POSIX.
  • Segments and segment numbers (file descriptions and file descriptors, respectively, if I’m not mistaken).

It is said that Unix took inspiration from Multics and simplified things. What was lost? Whoever implemented the file system for Unix stopped reading the paper somewhere around subsection 2.3. There are at least these features we’re missing:

  • Access control lists can define a TRAP for a user. Users can refuse to use TRAPs and instead get an error. Otherwise TRAPs call a named function at the time of the access. They can be used to implement smart files that can e.g. monitor file usage or apply password-based file locks.

    (Someone will undoubtedly insert eBPF into the Linux VFS soon and we will have invented TRAPs again).

  • Multics fully separated the append mode and the write mode and made it available to users. You could actually have files and directories that were writable but not appendable, or appendable but not writable (or both).

    (Append-only files exist on Linux through file attributes settable by the super user or processes with CAP_LINUX_IMMUTABLE. This is very weak sauce. Immutable directories exist, but not append-only directories. And of course there is no way to have a writable file or directory that can’t grow).

  • Directory entries that point to secondary storage. This is a game changer for file system management. More on this below.

Multiple-level storage and backups

Files in the Multics file system could have directory entries that point to a secondary storage medium. At the time the secondary media were tapes, but today they could be mechanical hard drives, USB drives or network servers. But what is the point of having a file that isn’t there? It’s really quite clever and it only gets more clever the more you look at it.

When a Multics system was running out of disk space it could remove unused files from the primary medium, but keep the directory entries. If you then tried to open such a file it would ask you to wait while the file is being made available. It would issue instructions to the operator (perhaps later a tape robot) to have them retrieve the required tapes for restoring the file. Very useful when disks were small and tapes were plentiful.

But it gets even better when you consider backups. Today if you need to restore a Linux system from backups you do it this way: find the oldest full backup and restore it, then restore each incremental backup, going from oldest to newest. This means that the system can’t be used until you’ve completely restored it from the backups. With today’s disk sizes that could take a very, very long time.

Multics backups are done differently. To restore a completely hosed system you first restore the system files, which will allow you to boot. Then you restore the latest incremental backup. After this the file system will contain the most current directory entries. Just restore one incremental backup tape and all the files are already in their right places. This is much faster than waiting for everything to be restored!

Why is only the latest incremental backup required to get everything back in its place and working? The clever part is that the files might not yet be on the disk. In fact, most files will probably be on another backup medium. But the most recently used files have been restored, so you can most likely do useful work already, and all other files have their directory entries. If you try to open a file that hasn’t been restored then you’re asked to wait and the operator is told which tape to put in to get the file back.

Is something like this offered on any POSIX-compatible file system?

Xerox Alto

The Xerox Alto was also a system of many firsts. It is most famous for having the first windowing GUI, which inspired Apple’s design, which then inspired Microsoft’s design. Of course it had a hierarchical file system with version numbers, and even a network file system.

I would like to highlight the interesting disk structure of the Alto file system, which is described in the paper An Open Operating System for a Single-User Machine (1979) by Butler Lampson and Robert F. Sproull.

The designers of the system were concerned about data loss caused by hardware issues and buggy software. They came up with a feature to protect the user’s data against loss of any block on the disk. (Unless the user data was in that lost block, of course).

All blocks on the disk have a label that contains these fields:

  • A file identifier
  • A version number
  • A page number (block index)
  • A length
  • A next link (pointer to the next block in the file)
  • A previous link

The label means that each block is self-identifying. By my count this type of disk had an additional 14 bytes of label per each 512 byte block.

If the disk structure is damaged then a special program called the scavenger will iterate over all blocks and recover the structure. Contrast this to a file system like ext4 where a damaged structure can mean that huge swathes of data are rendered inaccessible.

The scavenger was also used to upgrade the disk structure. Imagine converting in-place between ext4, btrfs or ZFS by just running a scavenger for the new disk structure. It would e.g. regard the existing ext4 metadata as nonsense and build a new disk structure based on btrfs by using the labels. (It will not work with these specific file systems, but it demonstrates the principle).

Why don’t we have this today? Nobody wants to build a POSIX file system that uses labels because the disks we’re using have 512-byte or 4096-byte blocks and no room for a label. A label would have a very small overhead for each block, but even a very small overhead like this is not acceptable to file system designers. More seriously, it would mess up mmap(2). There are disks with slightly larger blocks meant to store a label, but they are more expensive and not very common. So we’re left with fragile file system structures and the sage advice that you’re supposed to have backups anyway (for which, see previous section).

The best modern alternative is something like Btrfs or ZFS. Blocks are checksummed and those checksums are stored next to the pointers in the disk structure. A modern version of the Alto file system should certainly have checksums, but not using labels gives weaker protections than what the Alto file system provided. If you want to see how relevant the Alto file system design is today, just read what using ZFS protects against, but then remember that it doesn’t use labels.

Hydra

Hydra is another historical operating system with innovations in the file system. It is described in the paper HYDRA: The Kernel of a Multiprocessor Operating System (1974) by W. Wulf, E. Cohen, W. Corwin, A. Jones, R. Levin, C. Pierson, and F. Pollack at Carnegie-Mellon University.

Hydra does away with the idea of ownership and a hierarchy of ever more privileged components. It uses a fundamentally different concept of protection and security than what POSIX is based on.

“Technologists like hierarchical structures – they are elegant; but experience from the real world shows they are not viable security structures. The problem, then, which HYDRA attempts to face squarely, is to maintain order in a nonhierarchical environment.”

HYDRA: The Kernel of a Multiprocessor Operating System (1974)

Resources that need protection, such as files, are called objects. You can apply an operation to an object, such as reading or writing. But to do so you need a reference to that object, which is called a capability. You can even know a file’s identity, but without the reference you can’t do anything.

This is similar, but not identical, to a safe programming language where you can’t modify an object (such as a vector) if you don’t have a reference to that object and the procedure that performs the operation (e.g. vector-set!). The analogue stretches pretty far actually: in the Hydra design it is even possible for objects (such as files) to become inaccessible and be subject to garbage collection.

Remember that in Multics you could run code dynamically with TRAPs. This was very weak sauce when compared to what you could do in Hydra. In Hydra it was possible to design the normal read, write (etc) operations yourself, and even provide additional operations. A file could be backed by an efficient binary data structure, but present itself as a text file.

The paper gives an example of a researcher who has very specific requirements for data protection. You will need to read the paper to see all the requirements, but basically the researcher wants control over a file, but also wants others to have partial control over the same file. The file structure is delicate and needs to be protected against corruption.

“This abstraction is critical in protecting users from file representations, and I consider any file system which does not support at least this form abstraction to now be hopelessly behind the state of the art.”

– joe in fa.works, July 1981 (message id anews.Aucbvax.2321)

Most people would not even be able to conceive that a file could fulfill the researcher’s requirements. Today the researcher in question would have to use a network service that implements an API to enforce the requirements. As usually happens with these things, that service would probably have some security vulnerability that lets users do more than they were supposed to (e.g. read any file, write to other files, or start a shell). But on Hydra it’s just an object with operations that implement the desired semantics, and it can’t be used to break the protection system.

Hydra’s design is still relevant today. We regularly see problems caused by the so called “confused deputy”, which is when a system was tricked into performing an action that the initiator of the action was not supposed to have the permission to do, but a middle-layer (deputy) did.

The lack of a capability system in POSIX is causing real world damage that affects billions of people. It might be an app on your phone that spies on all your messages, a JavaScript package that sends all your SSH and Bitcoin private keys to a thief, or your browser might be exploited, make itself root by running sudo with your privileges and then install a rootkit. Why can your browser run sudo? Sudo is by design a confused deputy.

Suppose only your SSH client had the operation required to use your SSH keys. Suppose the operation to export those SSH keys was only accessible via your login session when you specifically used your keyboard or mouse to make that export. This would be more secure.

Suppose the file system protected your files against data corruption (e.g. making it impossible to write a bad /etc/fstab)? Sounds pretty good to me.

Post script

Multics enabled a superior paradigm for storage management and backups. Xerox Alto allowed a disk structure to be restored even if all metadata blocks had been damaged. Hydra used a capability system to provide a powerful protection mechanism for data. All of these enabled interesting types of functionality that we are missing in POSIX.

There are many good ideas in historical file systems that have been completely dropped today. I’m building some sort of operating system in Scheme in my Loko Scheme project and would like to explore these ideas there.