I’ve been working on optimizations in Loko Scheme recently and have implemented large parts of A Sufficiently Smart Compiler for Procedural Records (Keep & Dybvig, 2012). At the same time I have improved the representation of record type descriptors and wanted to share a simple trick I used to improve record type checks for non-sealed records. But first I should explain what a record is in Scheme.

Background: Records in Scheme

Scheme supports record types, which are user-defined data types. R⁷RS Scheme has a syntax-based variant of this feature, based on SRFI-9. Here’s an example:

(import (scheme base))

(define-record-type point
  (make-point x y)
  point?
  (x point-x point-x-set!)
  (y point-y point-y-set))

This will let you write (make-point 0.0 0.0) to get a point at (0.0, 0.0), and (point-x p) to access the x field of p. That’s it for records in R⁷RS and SRFI-9.

The record type system in R⁶RS Scheme improves on SRFI-9 in several ways. In R⁶RS you would instead write this:

(import (rnrs))

(define-record-type point
  (fields (mutable x)
          (mutable y)))

There is no longer any need to explicitly write out the names of the constructor, predicate, accessors and mutators (unless you want to). Additionally, you can extend a record type, you can customize the constructor, and you can control what happens if define-record-type runs multiple times, i.e. if it makes a new type each time or not.

A syntactical record layer can be abstraction on top of a procedural layer. What you can do with syntax, you can also do with procedure calls at runtime. R⁶RS standardizes this layer as well. It also standardizes a record inspection layer that lets you get the record type descriptor (RTD) from a record at runtime (unless it’s marked as opaque) and also to inspect all aspects of RTDs. In fact, RTDs are objects in their own right, just like pairs and symbols.

The above record type definition might expand to this code that uses the procedural layer (a real expansion would use fresh identifiers for the RTD and RCD):

(define point-rtd
  (make-record-type-descriptor
    'point #f #f #f #f
    '#((mutable x) (mutable y))))
(define point-rcd
  (make-record-constructor-descriptor
    point-rtd #f #f))

(define point? (record-predicate point-rtd))
(define make-point (record-constructor point-rcd))
(define point-x (record-accessor point-rtd 0))
(define point-y (record-accessor point-rtd 1))
(define point-x-set! (record-mutator point-rtd 0))
(define point-y-set! (record-mutator point-rtd 1))

This code uses the helpers record-predicate, record-constructor, record-accessor and record-mutator to create procedures. An intermediate record constructor descriptor contains the information needed to make the constructor.

Next we will have a look at records in memory, how the code above is optimized, and finally a trick to speed up record type checks.

Record Type Representation

Loko Scheme has a straightforward representation of records. Using the above point type as an example, here are the records returned by (make-point 1.0 2.0) and (make-point 1.5 -2.0) as they are structured in memory. The rows are 64-bit words and the arrows are tagged pointers. Memory allocations are aligned to 16 bytes, and the empty space created by alignment is also shown.

The information stored in the record is a pointer to the record type descriptor, which is reused for each record of the same type, followed by a slot for each field in the record.

The information in the record type descriptor is used by the record inspection procedures and the garbage collector. The slots contain: a type tag for the rtd itself, the size of the records, an optional parent type, an optional record unique identifier, the field names, field mutability (a bit-field), and an optional record writer procedure.

Loko uses the first slot in the RTD to store the length of the RTD and these flags: opaque?, sealed?, generative?.

Other R⁶RS implementations will have similar representations of RTDs because all this information is needed at runtime.

Single Inheritance

R⁶RS supports single inheritance for record types. Instead of demonstrating this with some contrived geometric shapes or balloon animals, let’s use an example from working code. The following record types are simplified variants of records used in Loko’s PCI library.

(define-record-type pcibar
  (fields reg base size))

(define-record-type pcibar-i/o
  (parent pcibar))

(define-record-type pcibar-mem
  (parent pcibar)
  (fields type prefetchable?))

PCI base address registers (BARs) all have these fields: reg, base, and size. If they are in I/O space then that’s all, but BARs in memory space have two additional fields: type and prefetchable?.

Notice that both pcibar-i/o and pcibar-mem point to pcibar as their parent. The size field is larger in pcibar-mem to account for the extra fields. The extra fields in the pcibar-mem record are placed immediately after the fields that belong to pcibar, so accessors for pcibar don’t need to recompute the slot numbers when passed a pcibar-mem.

Predicates for Non-Sealed Types

R⁶RS lets you say that a record type is sealed. This prevents a record type from being extended. As a consequence, type checks are more efficient. Why is that?

The record predicate pcibar? is given an object and returns true if the object has that record type, and false otherwise. If the implementation uses tagged pointers then the predicate first checks the tag. Next, it reads the type field of the object and compares it to the pcibar type.

But if types are not sealed then they can be extended, and it’s possible that the type that the predicate is checking for was used as a parent. The pcibar? predicate should return true even for a pcibar-mem record.

The Trick

Previously Loko Scheme’s record-predicate procedure worked as follows. It checked the RTD to see if it’s sealed. For sealed RTDs it returned a procedure that implemented the fast check described above.

For non-sealed RTDs another procedure was returned that did that check, and additionally looped over all parent RTDs to see if any of them was the desired RTD:

(define (record-predicate rtd)
  (if (record-type-sealed? rtd)
      (lambda (obj)     ;fast path
        (and
          ($box? obj)
          (eq? ($box-type obj) rtd)))
      (lambda (obj)     ;slow path
        (and
          ($box? obj)
          (let ((t ($box-type obj)))
            (or
              (eq? t rtd)
              (and
                (record-type-descriptor? t)
                (let lp ((t (record-type-parent t)))
                  (cond
                    ((eq? t rtd) #t)
                    ((not t) #f)
                    (else
                     (lp (record-type-parent t))))))))))))

I haven’t checked around, but I suspect that most R⁶RS implementations do something similar. Even when I checked Chez Scheme’s assembly output I saw a loop that’s morally equivalent to this one. This loop also shows up in accessors and mutators, because they need to know that the object they’ve been passed has the right type.

The trick: this loop can be avoided by extending the RTD representation so that each RTD directly contains pointers to all parent RTDs. The pointers are laid out so that the base type is placed first, followed by the sub-types in order. An RTD will then appear at a fixed location in any RTD that extends it.

I’m sure that there are other language implementations where this problem of sub-typing shows up and someone else has come up with just this optimization, because it’s kind of obvious.

Suppose that we have a base record type and some record types that extend each other. For simplicity, I will not give them any fields.

(define-record-type A)

(define-record-type B (parent A))
(define-record-type C (parent B))

(define-record-type S (parent A))
(define-record-type T (parent S))

The expression (A? (make-A)) evaluates to true, which also A? does for all types shown. But (B? (make-T)) evaluates to false because T does not have B anywhere in its chain of parents. That’s what the loop would be checking.

This picture shows the memory layout when the trick is used on these RTDs.

The pointer to the A RTD is always in the 0: slot for any RTD that extends it. Similarly, the predicate for B knows to always check in the 1: slot. A bounds check on the RTD is also needed in this layout.

Taking it further

Further improvements on this layout are possible. Loko Scheme always allocates memory for four parent RTDs. If an RTD will appear at slots 0 to 3, then the predicate does not need to do bounds checking on the RTD. The parent: slot is not strictly needed and can be removed.

Specially just for Loko Scheme, the predicates hidden inside accessors and mutators use slightly less code than the predicate procedures. These hidden predicates do not explicitly verify the tags on the pointers, instead leaving it up to the processor’s built-in alignment checking to trap invalid references.

The type checks are even faster if specialized predicates, accessors, and mutators can be generated at compile time. If the RTD is known at compile time then the slot that contains the RTD is also known and can be inlined.

Sufficiently Smart

A sufficiently smart compiler is a legendary compiler that does your favorite optimizations so that your favorite language feature becomes very efficient.

In A Sufficiently Smart Compiler for Procedural Records, Andy Keep and Kent Dybvig present their work on optimizing the R⁶RS procedural record system in Chez Scheme. It builds on top of the source-level optimizer cp0. Loko Scheme has its own implementation of cp0, so adapting their work has been pretty simple.

The basic idea is to have cp0 generate static or partially static RTDs, which are then propagated throughout the program using cp0’s existing mechanisms. If cp0 succeeds in propagating the RTDs to where record-accessor (etc) are called, then it can also generate code specialized to each record type.

I’ve implemented large parts of the ideas in the paper in Loko Scheme, together with the improved record type representation.

Post-script

Compiling Loko Scheme with Loko Scheme is now almost as fast as compiling it with Chez Scheme, if garbage collection time is not counted (run the compilation with something like LOKO_HEAP=28000 if you have enough RAM). I’m not sure it’s an apples-to-apples comparison though, because when Chez is used, it also has to load and compile Loko’s compiler, whereas Loko cheats by already having it loaded. But still, Loko’s performance is improving. I’m interested in seeing how well the next release will fare in the Scheme Benchmarks.