-*- coding: utf-8-with-signature; mode: org -*-
Notes on the implementation of fixnums
Copyright (c) 2011 Göran Weinholt <goran@weinholt.se>

WORK IN PROGRESS!

Keywords: scheme, lisp, fixnums, immediate integers, assembler

* Introduction

In Lisp implementations small integers are often implemented as
fixnums. A fixnum fits in a machine register, which means that no
memory has to be allocated for arithmetic performed on fixnums.
Fixnums can represent integers in the interval

  [-2^(w-1), 2^(w-1)-1]

where w is the fixnum-width. The fixnum-width is generally less than
the width of the machine registers. This is done in order to implement
"low-tagging", where the low bits of a pointer indicate the type of
the object it represents. One tag might be reserved for pairs and
another for vectors. The fixnum tag is generally zero because this
makes some operations more efficient.

This document contains some notes on efficient implementation of
fixnum operations. The context is that of R6RS Scheme and some
reasonable assumptions are made with regard to implementation details.
In order to make things more concrete examples are in amd64 assembly,
fixnum-width is assumed to be 61 and the tag is three zero bits. The
assembly syntax is that used by the disassembler in the Industria
Libraries.

It is also assumed that the constants for truth and falsehood fit
inside a 32-bit machine register. This affects the CMOVcc
instructions, which would otherwise have to use 64-bit operands.

The examples given here can be used to implement the general form of
the procedures. Better code can be generated if some procedure
arguments are known.

* Acknowledgements

During the development of the ideas found here I've received quite
some inspiration from a book, Hacker's Delight. Christian Häggström
suggested improvements to the machine code and also pushed me to find
better code sequences.

* Representation of fixnums

An integer is converted to a fixnum by shifting it left by three bits.
The exact integer 1, when stored in a register or memory location, is
represented as 8. This instruction loads (least-fixnum):

  48 B8 0000000000000080         (mov rax #x8000000000000000)

This instruction loads (greatest-fixnum):

  48 B8 F8FFFFFFFFFFFF7F         (mov rax #x7FFFFFFFFFFFFFF8)

And for completeness, this instruction loads (fixnum-width):

  B8 E8010000                    (mov eax #x1E8)

* Type dispatch

To check if the value in RAX is a fixnum is a simple matter:

  A8 07                          (test al #x7)

After this RFLAGS.Z is set if RAX contains a fixnum. Many fixnum
operations, e.g. fx+ and fx*, accept two arguments. To check if two
object are both fixnums is not much more complicated. Assuming that
RDI and RSI contain the objects to be tested, it can be done like
this, using RDX for temporary storage:

  8B C6                          (mov eax esi)
  0B C7                          (or eax edi)
  A8 07                          (test al #x7)

Note that the MOV and OR only use the lower 32 bits of the registers,
and the TEST instruction only uses the lower 8 bits. This is because
only the low bits matter in this case. The instructions have shorter
encodings than their 64-bit equivalents.

The design of the fixnum operators in R6RS is such that fixnum
operators always return fixnums, if they return at all. This means
that a clever compiler can remove a lot of runtime type checking. For
example, the fx+ in this expression does not need to check if its
operands are fixnums: (fx+ (fx- a b) (fx* c d)).

From now on this document will assume that the operands have been
checked and confirmed to be fixnums.

* Comparison predicates

Comparing two fixnums is merely a matter of knowing which condition
code (cc) to use. The comparison is first done with CMP. After this an
instruction like CMOVcc (for a predicate) or Jcc (for a conditional
expression) is used. The instructions SETcc might also be useful.

The following example makes two assumptions. It assumes that RDI is
the first operand and that RSI is the second operand. Furthermore it
assumes that the constants #f and #t have been loaded into RAX and RDX
respectively. This is how fx<? for two arguments can be written:

  48 3B FE                       (cmp rdi rsi)
  0F4C C2                        (cmovl eax edx)

The following table shows which condition code corresponds to which
comparison predicate:

| Comparison predicate | cc |
|----------------------+----|
| fx=?                 | e  |
| fx>?                 | g  |
| fx<?                 | l  |
| fx>=?                | ge |
| fx<=?                | le |

* Numerical predicates

The numerical predicates are straightforward. The following examples
assume that the operand is in RDI and that RAX/RDX have been prepared
just like for the comparison predicates. This is fxzero?:

  48 3B FF                       (cmp rdi rdi)
  0F44 C2                        (cmovz eax edx)

Here's fxpositive?:

  48 85 FF                       (test rdi rdi)
  0F4F C2                        (cmovns eax edx)

This is fxnegative?:

  48 85 FF                       (test rdi rdi)
  0F4C C2                        (cmovs eax edx)

This is fxodd?:

  40 F6 C7 08                    (test dil #x8)
  0F45 C2                        (cmovnz eax edx)

And finally fxeven?:

  40 F6 C7 08                    (test dil #x8)
  0F44 C2                        (cmovz eax edx)

The last two examples use DIL, which accesses the lower 8 bits of the
RDI register.

* Bounds checking vector references

It is beneficial if the length field of vectors is restricted to
fixnums. To perform bounds checking in vector-ref, code with this
semantics is needed:

  (if (not (and (fx>=? k 0) (fx<? k (vector-length vector))))
      (assertion-violation #f "out of bounds" vector k))

This bounds checking can be done with a single branch. Assuming k is
stored in RDI and the vector length is in RSI:

  48 39 F7                       (cmp rdi rsi)
  73 nn                          (jnb out-of-bounds-handler)

The reason this works is because a vector's length can never be
negative. The nb condition ("not below") performs an unsigned
comparison.

* Basic arithmetic

One fundamental difference between fixnum arithmetic and "normal"
signed arithmetic in language like C and Java, is in what happens on
overflow. According to the ISO C99 standard, what happens on overflow
for signed integers is undefined. R6RS Scheme on the other hand
requires that an exception is raised if the result is not
representable as a fixnum.

The operations that are affected by this are: fx+, fx*, fx- (both
unary and binary), fxdiv-and-mod, fxdiv, fxmod, fxdiv0-and-mod0,
fxdiv0, fxmod0, fxarithmetic-shift and fxarithmetic-shift-left.

Fortunately the processor makes the overflow check easy. After
instructions like ADD and SUB the processor updates RFLAGS to indicate
if the result overflowed. This is how fx+ can be implemented, assuming
that the first operand is in RSI and the second in RDI:

  48 01 FE                       (add rsi rdi)
  70 nn                          (jo overflow-handler)

Here's fx- (for two arguments):

  48 2B F7                       (sub rsi rdi)
  70 nn                          (jo overflow-handler)

This is what fx* looks like, assuming again that the first operand is
in RSI and the second in RDI:

  48 C1 FE 03                    (sar rsi #x3)
  48 0FAF F7                     (imul rsi rdi)
  70 nn                          (jo overflow-handler)

The SAR instruction is necessary because (8x)(8y) = 64(xy), but fx*
should compute 8(xy).

If the generic arithmetic procedures (+, *, -, etc) are written in
Scheme, then it might be beneficial to define "fx+/false",
"fx-/false", etc. An implementation of + might begin by quickly
checking if both arguments are fixnums. The code for doing the
addition with an overflow into bignums might then look like this:

  (or (fx+/false a b)
      (+ (fixnum->bignum a) (fixnum->bignum b))))

To implement fx+/false is a simple matter of using CMOVO instead of
JO. Here is what it looks like, assuming that RDX has been loaded with
the constant #f:

  48 03 F7                       (add rsi rdi)
  48 0F40 F2                     (cmovo rsi rdx)

* Arithmetic shifting

The procedures fxarithmetic-shift, fxarithmetic-shift-left and
fxarithmetic-shift-right implement bitwise arithmetic shifting. The
R6RS explains it this way:

  (fxarithmetic-shift fx1 fx2)                            procedure

  The absolute value of fx2 must be less than (fixnum-width). If
      (floor (* fx1 (expt 2 fx2)))
  is a fixnum, then that fixnum is returned. Otherwise an exception
  with condition type &implementation-restriction is raised.

Obviously the example given is only for clarification. There is no
need to use floor, * or expt because the processor has instructions
that implement arithmetic shift.

You're probably already familiar with arithmetic shift if you've
programmed in C. These operators are a bit different from the '<<' and
'>>' operators. The most obvious difference is that they check for
overflow and that they check that the shift amount is valid. Less
obvious is that '>>' for signed negative integers is implementation-
defined.

Shifting to the right always produces a number that is closer to zero,
so there is no way it can overflow. The following example shows how
fxarithmetic-shift-right can be done, assuming the first operand is in
RAX and the second in RCX:

  48 C1 F9 03                    (sar rcx #x3)
  48 D3 F8                       (sar rax cl)
  48 83 E0 F8                    (and rax #xFFFFFFFFFFFFFFF8)
  48 83 F9 3D                    (cmp rcx 61)
  73 nn                          (jnb out-of-bounds-handler)

The first SAR converts RCX from a fixnum and the second one does the
actual arithmetic shift. The AND turns RAX back into a fixnum. The
last two instructions verify that the shift amount is valid. The use
of CL (the lower byte of RCX) is necessary because that is the only
way to get SAR to use a variable shift amount.

Shifting to the left is slightly more tricky, because of the need to
check for overflow in addition to verifying the shift amount. Here's
one convoluted way of doing it:

  48 C1 F9 03                    (sar rcx #x3)
  48 8B F8                       (mov rdi rax)
  48 D3 E0                       (shl rax cl)
  48 8B D0                       (mov rdx rax)
  48 D3 FA                       (sar rdx cl)
  48 3B FA                       (cmp rdi rdx)
  BA nnnnnnnn                    (mov edx the-false-constant)
  48 0F45 C2                     (cmovnz rax rdx)
  48 83 F9 3D                    (cmp rcx 61)
  48 0F43 C2                     (cmovnb rax rdx)

The basic idea comes from Christian Häggström. First the number is
shifted to the left and then it is shifted back to the right. If that
number is not equal to the original number, then some bits were lost
and the result is not representable as a fixnum. The version above is
branchless and will either return the correct result (given that RAX
is a fixnum), or it will return #f. It can easily be made to jump to
an error handler by replacing the CMOVcc instructions.

* The division operations

Division is slightly more complicated than the other arithmetic
operations. The operations fxdiv and fxdiv0 are not directly
implemented by the processor. The IDIV instruction truncates towards
zero, whereas fxdiv performs number-theoretic (Euclidean) division.
The fxdiv and fxmod procedures can be implemented in terms of quotient
and remainder, which can be implemented with IDIV. The fixnum
equivalents of these procedures can be called fxquotient and
fxremainder. Here's how fxremainder can be implemented, given that the
first operand is in RAX and the second in RDI:

  48 99                           (cqo)
  48 F7 FF                        (idiv rdi)

The remainder is placed in RDX. The CQO instruction propagates the
sign bit from RAX into RDX. To implement fxquotient there is one
special case that must be handled: (fxquotient (least-fixnum) -1). The
result would be one greater than (greatest-fixnum), and that number is
not representable as a fixnum. This is probably best handled by an
explicit check in fxdiv. Here's what fxquotient looks like, without
the check and given the same assumptions as above:

  48 99                           (cqo)
  48 F7 FF                        (idiv rdi)
  48 C1 E0 03                     (shl rax #x3)

One must also keep in mind that division by zero causes the processor
to signal a Divide Error exception. Either this can be handled by
catching the signal, or an explicit check must be inserted somewhere
before the IDIV instruction.

One way of implementing fxdiv requires greater precision than fixnums
provide. This can easily be accomplished in assembler by shifting the
operands to the right, or by using bignums. Here's one relationship
between div and quotient:

  (define (div n d)
    (cond ((negative? d)
           (if (negative? n)
               (quotient (+ n d 1) d)
               (quotient n d)))
          (else
           (if (negative? n)
               (quotient (+ (- n d) 1) d)
               (quotient n d)))))

The mod can be calculated with (- n (* d (div n d))). Alternatively
this relationship can be used:

  (define (mod n d)
    (let ((r (remainder n d)))
      (cond ((negative? d)
             (if (negative? r)
                 (- r d)
                 r))
            (else
             (if (negative? r)
                 (+ r d)
                 r)))))

There's an alternative way of calculating fxdiv-and-mod, requiring no
more precision than that provided by fixnums. Let (q,r) be the
quotient and remainder of n/d:

 fxdiv-and-mod n d = error        d = 0 or (n,d)=(-2^(w-1),-1)
 fxdiv-and-mod n d = (q,r)        r >= 0
 fxdiv-and-mod n d = (q+1,r-d)    d < 0
 fxdiv-and-mod n d = (q-1,r+d)    otherwise

This also has the advantage that it only requires a single IDIV.

* The fxlength procedure

The fxlength procedure finds the length of a fixnum. To quote R6RS:

   Returns the number of bits needed to represent fx if it is
   positive, and the number of bits needed to represent (fxnot fx) if
   it is negative, [...].

The document then gives an implementation of fxlength using a loop
written in Scheme. A more efficient implementation is possible.
The following example assumes that RAX contains the operand:

  33 F6                          (xor esi esi)
  48 C1 F8 02                    (sar rax #x2)
  48 0FBD C0                     (bsr rax rax)
  0F44 C6                        (cmovz eax esi)
  C1 E0 03                       (shl eax #x3)

Note that this only handles non-negative operands. The main ingredient
is the BSR (Bit Scan Reverse) instruction. The CMOVZ instruction is
needed because BSR is undefined for zero. There are several ways of
handling negative numbers, here's how to do it with CMOVL:

  33 F6                          (xor esi esi)
  48 C1 F8 02                    (sar rax #x2)
  48 8B F8                       (mov rdi rax)
  48 F7 D0                       (not rax)
  48 0F4C F8                     (cmovl rdi rax)
  48 0FBD C7                     (bsr rax rdi)
  0F44 C6                        (cmovz eax esi)
  C1 E0 03                       (shl eax #x3)

If you're using GCC then the __builtin_clz function provides the same
functionality as BSR.

** Decimal version of fxlength

In Hacker's Delight, on page 219, there's a function that computes the
integer logarithm base 10. Here's a definition for R6RS Scheme. It
assumes the number is positive:

  (define (fxilog10 n)
    (define table1 '#(0  0  0  0  1  1  1  2  2  2
                      3  3  3  3  4  4  4  5  5  5
                      6  6  6  6  7  7  7  8  8  8
                      9  9  9  9  10 10 10 11 11 11
                      12 12 12 12 13 13 13 14 14 14
                      15 15 15 15 16 16 16 17 17 17
                      18 18 18 18 19 19 19))
    (define table2 '#(1 10 100 1000 10000 100000 1000000 10000000
                      100000000 1000000000 10000000000 100000000000
                      1000000000000 10000000000000 100000000000000
                      1000000000000000 10000000000000000
                      100000000000000000 1000000000000000000))
    (let ((y (vector-ref table1 (fxlength n))))
      (if (fx<? n (vector-ref table2 y))
          (fx- y 1)
          y)))

This procedure can be used to find how many decimal digits are needed
to represent a fixnum. Replace the first entry with 1 and add 1 to the
result.

* The fxfirst-bit-set procedure

The fxfirst-bit-set procedure returns the index of the least
significant one-bit. Another way of phrasing it is that the procedure
counts the number of zeros at the rightmost part of an integer. This
is similar to fxlength and can be implemented with the BSF
instruction. This examples assumes the operand is in RAX:

  BE 02000000                    (mov esi #x2)
  48 0FBC C0                     (bsf rax rax)
  0F44 C6                        (cmovz eax esi)
  48 8D 04C5 E8FFFFFF            (lea rax (mem+ #x-18 (* rax #x8)))

The CMOVZ instruction is there because (fxfirst-bit-set 0) = -1. The
LEA instruction adjusts the index to account for the fixnum tag and
also turns RAX into a fixnum.

* Reversing bit fields

The fxreverse-bit-field procedure is probably best implemented in C or
assembler using one of the methods from Hacker's Delight. The method
can also be seen here: http://aggregate.org/MAGIC/#Bit%20Reversal

It's possible to use this method in Scheme code. I've written an
fxreverse-bit-field procedure that tries to use the above method if
the fixnum width is either 61 or 30. If it's neither of those it falls
back on a loop. The code is here:

http://weinholt.se/hacks/scheme/fxreverse-bit-field.scm

* Additional reading

Great book full of bit twiddling hacks: http://www.hackersdelight.org/

The specification of fixnums in R6RS:
http://www.r6rs.org/final/html/r6rs-lib/r6rs-lib-Z-H-12.html#node_sec_11.2

* To be written...
fxmax, fxmin, fxbit-count, fxdiv0-and-mod0