User Tools

Site Tools


sd-8516_isa_profile

SD-8516 ISA Profile

Writing assembly language programs on the SD-8516 is similar, but different to writing them on the 8510. For one, programs are entered in much the same way:

ISA Profile Chart

The following program illustrates a baseline:

.address $010100

    LDA $1010
    LDB $0101

    LDCD $0FFFFFFF          ; Load 1,000,000 into CD (0x989680 is 10 mil)

loop:
    DEC CD                ; Decrement CD
    JNZ @loop              ; Jump to loop if CD != 0

    HALT                  ; Halt when done

This program executes at a certain speed we can call X. It doesn't matter what the speed is for now, suffice to say it is X in terms of MIPS or some other benchmark. When discussing the profiling of a command, we have to determine if it pulls the execution of this loop up or down from X. In this manner we can judge the relative speed of the instruction; if A is the speed of DEC, and B is the speed of JNZ, then the portion remaining goes to the instruction being profiled. However, when adding just one instruction, it is difficult to judge the true speed of the instruction in question. The solution is to increase the number of instructions per loop, which is known in a way as unrolling the loop.

One idea is to increase the number DEC instructions relative to JNZ and see what happens. In the regular run I got a score of 77 MIPS on my 12600k. Increasing the DEC:JNZ ratio to 10:1 brought us down to 56 mips. At 100:1 we got 54 MIPS.

On the other side, a program that tests JNZ to DEC 10:1 brings MIPS up to 91. In either case, a nearly 20 MIPS difference. Therefore clearly, JNZ is a much faster operation than DEC, although you would expect DEC to be a lot faster than JNZ! The reason why is that DEC CD is very slow, as it is a dual register DEC. Moving to single register DEC increases the speed by 50-100%:

.address $010100

    LDC #10000
    LDD #25000

loop:
    DEC C
    DEC C
    DEC C
    DEC C
    DEC C
    DEC C
    DEC C
    DEC C
    DEC C
    DEC C
    JNZ loop

    ; C reached zero, decrement D
    LDC #10000
    DEC D
    JNZ loop

    ; done
    HALT

This version runs at 90 MIPS. Considering all of the results so far, we'll use the double C counter version with 20 executions of the profiling instruction unrolled inside the loop. We'll also take the C loop down to 10,000 from 30,000 seeing as how we will be unrolling instructions in the loop, and they are almost surely bound to be slower.

The following chart indicates the best results out of several runs:

LDA

Instruction Execution time Notes
Empty Loop 97 MIPS
LDA [$1000]x10 90 MIPS
LDA [$1000]x100 95 MIPS
LDAL [$1000]x20 85 MIPS Not native word size
LDAB [$1000]x20 76 MIPS unexpected! will check code
LDBLX [$1000]x20 25 MIPS array method method
LDBLX [$1000]x20 45 MIPS switch method
LDBLX [$1000]x20 64 MIPS unified memory reads
LDBLX [$1000]x20 73 MIPS inlined acceess

Notes on LDA/LDAL

This is likely a branch prediction and instruction cache artifact in the Web Assembly/JavaScript JIT compiler. With the empty loop, the CPU's branch predictor may be working against speculative execution overhead. Adding a single LDA gives the pipeline something productive to do between branches, potentially hiding some of the branch misprediction penalty or better aligning the instruction stream. At 10-20 instructions, you're hitting different bottlenecks:

Increased loop body size may cause instruction cache pressure More register pressure in the generated machine code Loop overhead becomes proportionally smaller but absolute instruction decode cost increases

The LDAL slowdown confirms this - non-native 32-bit operations require more complex codegen, putting additional pressure on the optimizer. This is classic JIT behavior: a tiny amount of work can sometimes improve performance by giving the CPU's execution units better scheduling opportunities, but too much work overwhelms those benefits. You might also be seeing alignment effects - the single instruction could be placing the loop branch at an optimal address boundary.

Finally, using LDBLX as a proxy for the process we went through earlier, we achieved a 3x speedup by using a switch versus a map, unifying <u8> memory reads into <u32>, and inlining the the load() calls into the opcode handler.

I wouldn't want to do this for every instruction because it produces ugly, hard to maintain code, but it works like a charm!

DEC

A loop with 20xDEC had a high mark of 104.7 MIPS.

PUSH/POP

  • PUSH and POP are slower operations, in the 80-85 MIPS range.
  • But PUSHA/POPA are noticeably slow, in the 27 MIPS range.
  • Using PUSHA/POPA everywhere will kill performance. We saw a 25% increase in speed after moving from PUSHA to push (reg).
sd-8516_isa_profile.txt · Last modified: by appledog

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki