Internals of Zabavno the x86 emulator

Zabavno (Забавно) is an x86 emulator I’ve been working on in my spare time. It translates x86 instructions into Scheme and eval’s them, which works surprisingly well.

The initial commit was made two years ago, but at the time I only worked on it for a few weeks. When Chez Scheme was open sourced I got interested in it again, since the techniques used depend on having a good compiler. (And Chez Scheme is a really good compiler).

Here’s a look at the internals of the emulator. The rest of this article gets very technical and assumes that the reader knows something about CPUs. The core of the CPU emulation happens in this procedure:

(define (%run-until-abort M debug instruction-limit
                          fl ip AX CX DX BX SP BP SI DI
                          cs ds ss es fs gs)
  (call/cc
    (lambda (abort)
      (let loop ((ip ip) (fl fl) (AX AX) (CX CX) (DX DX) (BX BX)
                 (SP SP) (BP BP) (SI SI) (DI DI)
                 (cs cs) (ds ds) (ss ss) (es es) (fs fs) (gs gs))
        ;; Translate instruction(s) or get an existing translation.
        (let ((trans (translate cs ip debug instruction-limit)))
          ;; Call the translation and get new values for the
          ;; registers. The translation may choose to abort in the
          ;; middle of a translation.
          (let-values (((ip^ fl^ AX^ CX^ DX^ BX^ SP^ BP^ SI^ DI^
                             cs^ ds^ ss^ es^ fs^ gs^)
                        (trans abort fl AX CX DX BX SP BP SI DI
                               cs ds ss es fs gs)))
            (loop ip^ fl^ AX^ CX^ DX^ BX^ SP^ BP^ SI^ DI^
                  cs^ ds^ ss^ es^ fs^ gs^)))))))

The translate procedure translates a block of machine instructions from memory at the address pointed to by cs:ip. The translated code is a procedure that takes the registers as arguments and returns new values for the registers. This is then done in a loop over and over, with the result the program in the emulated machine runs. Translations are cached, so if the emulator sees the same code address again it can just use the previous translation. This makes things a lot faster.

One might think that translation caching would violate the processor semantics in some way. Actually, it’s the opposite. Real processors have a buffer that tries to stay ahead of the actual code execution. One way to detect that code is running under a debugger is to modify an instruction that should have been pre-fetched by the processor. If a debugger is single-stepping then the modified code will be used, otherwise the original pre-fetched code is used. We can mimic the real processor’s semantics by putting multiple instructions into the same translated block. One just needs to be careful to invalidate the cache when writing to memory, but there is no need to interrupt due to writes in the middle of a translated block.

There are other benefits to putting multiple instructions in the same translated block. A lot of translated code will compute some intermediate result that is never actually used. This seems counter-intuitive since an optimizing compiler should have removed all such computations. However, the x86 architecture has a flags register that is updated automatically in a very inconsistent manner. This is how it looked in Intel’s 80386 programmer’s reference manual (1986):

Picture of the flags register, because sadly Firefox is unable to correctly render box characters

The lower bits of the flags register holds information about arithmetic results: overflow, sign, zero and carry. Then there is the more interesting parity flag that contains the even parity of the lowest byte of the result, and the even more interesting auxiliary carry that signals carry/borrow from the lower four bits of the result (used for bcd). The processor can compute these flags as a side-effect of its normal operation, but the emulator has no such luxury. This is a lot of overhead for each and every arithmetic instruction.

Zabavno emits code to update the flags, but it’s done in a clever way so that in the majority of cases the code is never used. As was mentioned earlier, a translation block can contain multiple instructions. When one arithmetic instruction is followed by another one, the second tends to overwrite the flags. In that case it’s completely unnecessary to do the first flag update.

The bookkeeping for tracking which flags should be used and which should be discarded is generally a bit tricky, since a lot of instructions use the flags as input, and some instructions update only a few of them, and a lot of the time the flags are left undefined. Zabavno outsources almost all bookkeeping to the host Scheme’s optimizer. The code generator was written with cp0 (from Oscar Waddell’s PhD thesis) in mind. This optimizer is available in Chez Scheme and a few other ones.

Let’s look at the translation of a small instruction sequence. This code updates the eax register twice and computes flags twice:

example:
    sub eax,edx
    add eax,ecx

As previously mentioned, the code generator in Zabavno emits code that returns new values for the registers. This makes it easy to wrap each translated instruction in a lambda and call it with the registers returned by the previous instruction. It also happens that cp0 is very good at optimizing this kind of code. The arithmetic flags are also wrapped in lambdas and are called when they are needed. They are never used as first-class procedures, and they are quite small, so cp0 inlines them every time and no memory allocations are needed. Here is the initial (huge, nightmarish) translation of the example program:

(lambda (abort fl AX CX DX BX SP BP SI DI cs ds ss es fs gs)
  (define RAM
    (case-lambda
      [(addr size)
       (case size
         [(8) (memory-u8-ref addr)]
         [(16) (memory-u16-ref addr)]
         [(32) (memory-u32-ref addr)])]
      [(addr size value)
       (case size
         [(8) (memory-u8-set! addr value)]
         [(16) (memory-u16-set! addr value)]
         [(32) (memory-u32-set! addr value)])]))
  (define I/O
    (case-lambda
      [(addr size) (port-read addr size)]
      [(addr size value) (port-write addr size value)]))
  (let ([fl (lambda () fl)]
        [fl-OF (lambda () (fxand fl 2048))]
        [fl-SF (lambda () (fxand fl 128))]
        [fl-ZF (lambda () (fxand fl 64))]
        [fl-AF (lambda () (fxand fl 16))]
        [fl-PF (lambda () (fxand fl 4))]
        [fl-CF (lambda () (fxand fl 1))])
    ((lambda (fl AX CX DX BX SP BP SI DI cs ds ss es fs gs)
       (let* ([t0 AX]
              [t1 DX]
              [tmp (fx- t0 t1)]
              [result (fxand tmp 4294967295)]
              [fl-OF (lambda ()
                       (if (fxbit-set?
                            (fxand (fxxor t0 t1) (fxxor t0 result))
                            31)
                           2048
                           0))]
              [fl-SF (lambda ()
                       (if (fxbit-set? result 31) 128 0))]
              [fl-ZF (lambda ()
                       (if (eqv? result 0) 64 0))]
              [fl-AF (lambda ()
                       (if (fxbit-set? (fx- (fxand t0 15)
                                            (fxand t1 15))
                                       4)
                           16
                           0))]
              [fl-PF (lambda ()
                       (if (vector-ref
                            byte-parity-table
                            (fxand result 255))
                           4
                           0))]
              [fl-CF (lambda () (if (fxbit-set? tmp 32) 1 0))]
              [AX result])
         ((lambda (fl AX CX DX BX SP BP SI DI cs ds ss es fs gs)
            (let* ([t0 AX]
                   [t1 CX]
                   [tmp (fx+ t0 t1)]
                   [result (fxand tmp 4294967295)]
                   [fl-OF (lambda ()
                            (if (fxbit-set?
                                 (fxand (fxxor t0 result)
                                        (fxxor t1 result))
                                 31)
                                2048
                                0))]
                   [fl-SF (lambda ()
                            (if (fxbit-set? result 31) 128 0))]
                   [fl-ZF (lambda ()
                            (if (eqv? result 0) 64 0))]
                   [fl-AF (lambda ()
                            (if (fxbit-set?
                                 (fx+ (fxand t0 15) (fxand t1 15))
                                 4)
                                16 0))]
                   [fl-PF (lambda ()
                            (if (vector-ref byte-parity-table
                                            (fxand result 255))
                                4
                                0))]
                   [fl-CF (lambda () (if (fxbit-set? tmp 32) 1 0))]
                   [AX result])
              (let* ([fl (lambda ()
                           (fxior (fxand (fl) -2262) (fl-OF) (fl-SF)
                                  (fl-ZF) (fl-PF)
                                  (fxior (fl-AF) (fl-CF))))])
                (values 262 (fl) AX CX DX BX SP BP SI DI
                        cs ds ss es fs gs))))
          fl AX CX DX BX SP BP SI DI
          cs ds ss es fs gs)))
     fl AX CX DX BX SP BP SI DI
     cs ds ss es fs gs)))

There is some initial setup code for accessing memory and the i/o bus. Then the initial values of the flags are wrapped in lambdas. The instructions themselves correspond to the let* expressions. The innermost let* computes the flags and then returns all registers. Pretty terrible with a lot of lambdas, so the GC will be invoked very frequently if this is what the emulator uses. But this is the code after cp0 has optimized it:

(lambda (abort fl AX CX DX BX SP BP SI DI cs ds ss es fs gs)
  (let ([result (fxand #xffffffff (fx- AX DX))])
    (let ([tmp (fx+ result CX)])
      (let ([result (fxand #xffffffff tmp)])
        (values 262
          (fxior (fxand -2262 fl)
                 (if (fxbit-set? (fxand (fxxor result result)
                                        (fxxor CX result)) 31)
                2048 0)
            (if (fxbit-set? result 31) 128 0)
            (if (eqv? result 0) 64 0)
            (if (vector-ref byte-parity-table (fxand #xff result))
                4 0)
            (fxior (if (fxbit-set? (fx+ (fxand 15 result)
                                        (fxand 15 CX))
                                   4)
                       16 0)
                   (if (fxbit-set? tmp 32)
                       1 0)))
          result CX DX BX SP BP SI DI cs ds ss es fs gs)))))

None of the remaining code needs to allocate memory (in fact only the vector-ref call needs to read from memory). There is still some room for improvement, but the flags are only computed once. If there was an instruction that did need a flag from sub then cp0 would see that, and that flag would be computed. In general the flags tend to be computed only once per block.

This is how the core of the CPU emulation works. Now that you’ve seen it, why don’t you give it a try? Zabavno is available from github:

git clone https://github.com/weinholt/zabavno/