Non-POSIX file systems

Written by Gwen Weinholt on 2020-08-30

Operating systems and file systems have traditionally been developed hand in hand. They impose mutual constraints on each other. Today we have two major leaders in file system semantics: Windows and POSIX. They are very close to each other when compared to the full set of possibilities. Interesting things happened before POSIX monopolized file system semantics.

When you use a file system through a library instead of going through the operating system there are some extra possibilities. You are no longer required to obey the host operating system’s semantics for filenames. You get to decide if you use / or \ to separate directory components (or something else altogether). Maybe you don’t even use strings for filenames. The fs-fatfs library uses a list of strings, so it’s up to the caller to define a directory separator for themselves. While working on that library, I was driven to write down some ideas that I’ve previously run across and found inspirational.

Multics

This part was based on a paper that was written before Multics was implemented. See the bottom of this page for corrections.

The very first hierarchical file system was developed for Multics. It is described in the paper A General-Purpose File System For Secondary Storage (1965) by Robert C. Daley and Peter G. Neumann. There are several things that I find astounding about this paper:

  • There were apparently no hierarchical file systems before Multics. The references do no cite any previous work on this and I haven’t found any.
  • Privacy in computing was a big consideration even back then.
  • Modern hierarchical file system are essentially unchanged from 1965, but some features disappeared along the way.
  • There’s a very interesting backup system described in this paper.

The paper describes these concepts that we still use today:

  • Files.
  • Directories.
  • Root directory.
  • Working directory.
  • Tree name, i.e. the name of a file or directory in the tree structure. Akin to realpath(3).
  • Path name, which is like a tree name, but it is conceptually different because components can be links. The paper used ":" instead of "/" to separate directories.
  • Relative paths. They used an initial ":" in place of "./" (or empty) and "*" in place of "..". Instead of "cd .." you used "CHANGEDIRECTORY :*".
  • Symbolic links.
  • Access control using access control lists (showed up relatively recently in POSIX).
  • Read, execute and write modes on files and directories. Execute on a directory means permission to search it, just as in POSIX.
  • Segments and segment numbers (file descriptions and file descriptors, respectively, if I’m not mistaken).

It is said that Unix took inspiration from Multics and simplified things. What was lost? Whoever implemented the file system for Unix stopped reading the paper somewhere around subsection 2.3. There are at least these features we’re missing:

  • Access control lists can define a TRAP for a user. Users can refuse to use TRAPs and instead get an error. Otherwise TRAPs call a named function at the time of the access. They can be used to implement smart files that can e.g. monitor file usage or apply password-based file locks. (Erratum: TRAPs were actually never implemented.)

    (Someone will undoubtedly insert eBPF into the Linux VFS soon and we will have invented TRAPs again).

  • Multics fully separated the append mode and the write mode and made it available to users. You could actually have files and directories that were writable but not appendable, or appendable but not writable (or both).

    (Append-only files exist on Linux through file attributes settable by the super user or processes with CAP_LINUX_IMMUTABLE. This is very weak sauce. Immutable directories exist, but not append-only directories. And of course there is no way to have a writable file or directory that can’t grow).

  • Directory entries that point to secondary storage. This is a game changer for file system management. More on this below.

Multiple-level storage and backups

Files in the Multics file system could have directory entries that point to a secondary storage medium. At the time the secondary media were tapes, but today they could be mechanical hard drives, USB drives or network servers. But what is the point of having a file that isn’t there? It’s really quite clever and it only gets more clever the more you look at it.

When a Multics system was running out of disk space it could remove unused files from the primary medium, but keep the directory entries. If you then tried to open such a file it would ask you to wait while the file is being made available. It would issue instructions to the operator (perhaps later a tape robot) to have them retrieve the required tapes for restoring the file. Very useful when disks were small and tapes were plentiful.

But it gets even better when you consider backups. Today if you need to restore a Linux system from backups you do it this way: find the oldest full backup and restore it, then restore each incremental backup, going from oldest to newest. This means that the system can’t be used until you’ve completely restored it from the backups. With today’s disk sizes that could take a very, very long time.

Multics backups are done differently. To restore a completely hosed system you first restore the system files, which will allow you to boot. Then you restore the latest incremental backup. After this the file system will contain the most current directory entries. Just restore one incremental backup tape and all the files are already in their right places. This is much faster than waiting for everything to be restored!

Why is only the latest incremental backup required to get everything back in its place and working? The clever part is that the files might not yet be on the disk. In fact, most files will probably be on another backup medium. But the most recently used files have been restored, so you can most likely do useful work already, and all other files have their directory entries. If you try to open a file that hasn’t been restored then you’re asked to wait and the operator is told which tape to put in to get the file back.

Is something like this offered on any POSIX-compatible file system?

Xerox Alto

The Xerox Alto was also a system of many firsts. It is most famous for having the first windowing GUI, which inspired Apple’s design, which then inspired Microsoft’s design. Of course it had a hierarchical file system with version numbers, and even a network file system.

I would like to highlight the interesting disk structure of the Alto file system, which is described in the paper An Open Operating System for a Single-User Machine (1979) by Butler Lampson and Robert F. Sproull.

The designers of the system were concerned about data loss caused by hardware issues and buggy software. They came up with a feature to protect the user’s data against loss of any block on the disk. (Unless the user data was in that lost block, of course).

All blocks on the disk have a label that contains these fields:

  • A file identifier
  • A version number
  • A page number (block index)
  • A length
  • A next link (pointer to the next block in the file)
  • A previous link

The label means that each block is self-identifying. By my count this type of disk had an additional 14 bytes of label per each 512 byte block.

If the disk structure is damaged then a special program called the scavenger will iterate over all blocks and recover the structure. Contrast this to a file system like ext4 where a damaged structure can mean that huge swathes of data are rendered inaccessible.

The scavenger was also used to upgrade the disk structure. Imagine converting in-place between ext4, btrfs or ZFS by just running a scavenger for the new disk structure. It would e.g. regard the existing ext4 metadata as nonsense and build a new disk structure based on btrfs by using the labels. (It will not work with these specific file systems, but it demonstrates the principle).

Why don’t we have this today? Nobody wants to build a POSIX file system that uses labels because the disks we’re using have 512-byte or 4096-byte blocks and no room for a label. A label would have a very small overhead for each block, but even a very small overhead like this is not acceptable to file system designers. More seriously, it would mess up mmap(2). There are disks with slightly larger blocks meant to store a label, but they are more expensive and not very common. So we’re left with fragile file system structures and the sage advice that you’re supposed to have backups anyway (for which, see previous section).

The best modern alternative is something like Btrfs or ZFS. Blocks are checksummed and those checksums are stored next to the pointers in the disk structure. A modern version of the Alto file system should certainly have checksums, but not using labels gives weaker protections than what the Alto file system provided. If you want to see how relevant the Alto file system design is today, just read what using ZFS protects against, but then remember that it doesn’t use labels.

Hydra

Hydra is another historical operating system with innovations in the file system. It is described in the paper HYDRA: The Kernel of a Multiprocessor Operating System (1974) by W. Wulf, E. Cohen, W. Corwin, A. Jones, R. Levin, C. Pierson, and F. Pollack at Carnegie-Mellon University.

Hydra does away with the idea of ownership and a hierarchy of ever more privileged components. It uses a fundamentally different concept of protection and security than what POSIX is based on.

“Technologists like hierarchical structures – they are elegant; but experience from the real world shows they are not viable security structures. The problem, then, which HYDRA attempts to face squarely, is to maintain order in a nonhierarchical environment.”

HYDRA: The Kernel of a Multiprocessor Operating System (1974)

Resources that need protection, such as files, are called objects. You can apply an operation to an object, such as reading or writing. But to do so you need a reference to that object, which is called a capability. You can even know a file’s identity, but without the reference you can’t do anything.

This is similar, but not identical, to a safe programming language where you can’t modify an object (such as a vector) if you don’t have a reference to that object and the procedure that performs the operation (e.g. vector-set!). The analogue stretches pretty far actually: in the Hydra design it is even possible for objects (such as files) to become inaccessible and be subject to garbage collection.

Remember that in Multics you could run code dynamically with TRAPs. This was very weak sauce when compared to what you could do in Hydra. In Hydra it was possible to design the normal read, write (etc) operations yourself, and even provide additional operations. A file could be backed by an efficient binary data structure, but present itself as a text file.

The paper gives an example of a researcher who has very specific requirements for data protection. You will need to read the paper to see all the requirements, but basically the researcher wants control over a file, but also wants others to have partial control over the same file. The file structure is delicate and needs to be protected against corruption.

“This abstraction is critical in protecting users from file representations, and I consider any file system which does not support at least this form abstraction to now be hopelessly behind the state of the art.”

– joe in fa.works, July 1981 (message id anews.Aucbvax.2321)

Most people would not even be able to conceive that a file could fulfill the researcher’s requirements. Today the researcher in question would have to use a network service that implements an API to enforce the requirements. As usually happens with these things, that service would probably have some security vulnerability that lets users do more than they were supposed to (e.g. read any file, write to other files, or start a shell). But on Hydra it’s just an object with operations that implement the desired semantics, and it can’t be used to break the protection system.

Hydra’s design is still relevant today. We regularly see problems caused by the so called “confused deputy”, which is when a system was tricked into performing an action that the initiator of the action was not supposed to have the permission to do, but a middle-layer (deputy) did.

The lack of a capability system in POSIX is causing real world damage that affects billions of people. It might be an app on your phone that spies on all your messages, a JavaScript package that sends all your SSH and Bitcoin private keys to a thief, or your browser might be exploited, make itself root by running sudo with your privileges and then install a rootkit. Why can your browser run sudo? Sudo is by design a confused deputy.

Suppose only your SSH client had the operation required to use your SSH keys. Suppose the operation to export those SSH keys was only accessible via your login session when you specifically used your keyboard or mouse to make that export. This would be more secure.

Suppose the file system protected your files against data corruption (e.g. making it impossible to write a bad /etc/fstab)? Sounds pretty good to me.

Post script

Multics enabled a superior paradigm for storage management and backups. Xerox Alto allowed a disk structure to be restored even if all metadata blocks had been damaged. Hydra used a capability system to provide a powerful protection mechanism for data. All of these enabled interesting types of functionality that we are missing in POSIX.

There are many good ideas in historical file systems that have been completely dropped today. I’m building some sort of operating system in Scheme in my Loko Scheme project and would like to explore these ideas there.

Corrections (July 2021)

I wrote this article in August 2020 without fully researching the history of Multics and what actually became of the ideas in practice. As it turns out, many things changed when it was implemented.

My intention in writing this article was that it would open up some minds to how file systems can be more than they currently are. To the extent that this may have worked, I think it was good. But it’s not good that some myths about Multics have spread this way.

The world is generally a disappointment, but don’t let that stop you from looking for something better.

I’m very grateful to Tom Van Vleck for writing me to correct this article. All mistakes in the article are completely on me and can’t be blamed on him in any way. Below is his mail to me (republished with permission).

From: Tom Van Vleck <…@multicians.org>
Subject: Multics file system
To: Göran Weinholt <…@weinholt.se>

Hello Göran, I ran across your "Non-POSIX file systems" web page.

It has a few misstatements -- I am sure this is my fault for not 
presenting information clearly enough on multicians.org.

The Daley/Neumann paper on the site was written before the file
system was programmed.  Some design decisions were made as Multics
was developed that differed from the paper.  
- You already mention that the Trap attribute was not implemented.
- We changed the path component separator from ":" to ">".  
  Ken and Dennis changed that to "/" (possibly because that character
  did not require a shift key on the TTY37 terminals they used.)
  In Multics, the "<" worked to go up a level in the path hierarchy.
- in Multis as implemented, path names with no leading > or < were
  interpreted by the command processor (shell) as relative to the current 
  working directory.  File system API calls only accepted absolute
  pathnames.
- The paper's CHANGEDIRECTORY command was implemented as "change_wdir" (cwd).
- Segments and segment numbers had nothing to do with file descriptors.
  Segmentation was a memory addressing mechanism, completely integrated
  with the Multics virtual memory.  See https://multicians.org/daley-dennis.html
  and https://multicians.org/multics-vm.html for more info.  Multics 
  also provided a more traditional API to files and devices, similar
  to Unix streams, that mapped into the virtual memory.  Search the
  multicians.org site for "vfile_" if interested.
- Interpretation of the read,execute,write,append bits in file (segment)
  access control varied somewhat from the paper.  Append-only files were
  not supported by the hardware (remember that the Daley/Neumann paper was
  written before the GE645 was complete). We did not use execute-only files 
  because the compilers generated code that assumed constants were in the
  procedure segment.  The "append" bit was really only useful for 
  directories, where it was interpreted by the file system.
- Directory entries that pointed to secondary storage: Multics didn't do what
  you describe.  There was a mechanism, very like Unix mount points, for
  "master directories" whose subtree was mounted on a removable disk pack.
- The "removing unusued files contents but keeping the directory entry"
  feature you describe was never implemented.  Therefore, the "reload the
  latest incremental dump" recovery strategy you describe never happened.
  See https://multicians.org/failure-recovery.html for what Multics actually
  implemented.  The backup and recovery strategy changed over the years:
  by the 1980s, recovery from a catastrophic crash would first reload 
  volume dumps for the Hardcore Partition, which would restore directories,
  and then reload incremental dumps and other volume dumps.