sd-8516_isa_profile
Differences
This shows you the differences between two versions of the page.
| sd-8516_isa_profile [2026/01/20 02:10] – created appledog | sd-8516_isa_profile [2026/02/22 14:35] (current) – removed appledog | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | = SD-8516 ISA Profile | ||
| - | Writing assembly language programs on the SD-8516 is similar, but different to writing them on the 8510. For one, programs are entered in much the same way: | ||
| - | |||
| - | == ISA Profile Chart | ||
| - | The following program illustrates a baseline: | ||
| - | |||
| - | <codify armasm> | ||
| - | .address $010100 | ||
| - | |||
| - | LDA $1010 | ||
| - | LDB $0101 | ||
| - | |||
| - | LDCD $0FFFFFFF | ||
| - | |||
| - | loop: | ||
| - | DEC CD ; Decrement CD | ||
| - | JNZ @loop ; Jump to loop if CD != 0 | ||
| - | |||
| - | HALT ; Halt when done | ||
| - | </ | ||
| - | |||
| - | This program executes at a certain speed we can call X. It doesn' | ||
| - | |||
| - | One idea is to increase the number DEC instructions relative to JNZ and see what happens. In the regular run I got a score of 77 MIPS on my 12600k. Increasing the DEC:JNZ ratio to 10:1 brought us down to 56 mips. At 100:1 we got 54 MIPS. | ||
| - | |||
| - | On the other side, a program that tests JNZ to DEC 10:1 brings MIPS up to 91. In either case, a nearly 20 MIPS difference. Therefore clearly, JNZ is a much faster operation than DEC, although you would expect DEC to be a lot faster than JNZ! The reason why is that DEC CD is very slow, as it is a dual register DEC. Moving to single register DEC increases the speed by 50-100%: | ||
| - | |||
| - | <codify armasm> | ||
| - | .address $010100 | ||
| - | |||
| - | LDC #10000 | ||
| - | LDD #25000 | ||
| - | |||
| - | loop: | ||
| - | DEC C | ||
| - | DEC C | ||
| - | DEC C | ||
| - | DEC C | ||
| - | DEC C | ||
| - | DEC C | ||
| - | DEC C | ||
| - | DEC C | ||
| - | DEC C | ||
| - | DEC C | ||
| - | JNZ loop | ||
| - | |||
| - | ; C reached zero, decrement D | ||
| - | LDC #10000 | ||
| - | DEC D | ||
| - | JNZ loop | ||
| - | |||
| - | ; done | ||
| - | HALT | ||
| - | </ | ||
| - | |||
| - | This version runs at 90 MIPS. Considering all of the results so far, we'll use the double C counter version with 20 executions of the profiling instruction unrolled inside the loop. We'll also take the C loop down to 10,000 from 30,000 seeing as how we will be unrolling instructions in the loop, and they are almost surely bound to be slower. | ||
| - | |||
| - | The following chart indicates the best results out of several runs: | ||
| - | |||
| - | == LDA | ||
| - | |||
| - | ^ Instruction ^ Execution time ^ Notes | | ||
| - | | Empty Loop | 97 MIPS | | | ||
| - | | LDA [$1000]x10 | 90 MIPS | | | ||
| - | | LDA [$1000]x100 | 95 MIPS | | | ||
| - | | LDAL [$1000]x20 | 85 MIPS | Not native word size | | ||
| - | | LDAB [$1000]x20 | 76 MIPS | unexpected! will check code | | ||
| - | | LDBLX [$1000]x20 | 25 MIPS | array method method | | ||
| - | | LDBLX [$1000]x20 | 45 MIPS | switch method | | ||
| - | | LDBLX [$1000]x20 | 64 MIPS | unified memory reads | | ||
| - | | LDBLX [$1000]x20 | 73 MIPS | inlined acceess | | ||
| - | |||
| - | === Notes on LDA/LDAL | ||
| - | This is likely a branch prediction and instruction cache artifact in the Web Assembly/ | ||
| - | With the empty loop, the CPU's branch predictor may be working against speculative execution overhead. Adding a single LDA gives the pipeline something productive to do between branches, potentially hiding some of the branch misprediction penalty or better aligning the instruction stream. | ||
| - | At 10-20 instructions, | ||
| - | |||
| - | Increased loop body size may cause instruction cache pressure | ||
| - | More register pressure in the generated machine code | ||
| - | Loop overhead becomes proportionally smaller but absolute instruction decode cost increases | ||
| - | |||
| - | The LDAL slowdown confirms this - non-native 32-bit operations require more complex codegen, putting additional pressure on the optimizer. | ||
| - | This is classic JIT behavior: a tiny amount of work can sometimes improve performance by giving the CPU's execution units better scheduling opportunities, | ||
| - | You might also be seeing alignment effects - the single instruction could be placing the loop branch at an optimal address boundary. | ||
| - | |||
| - | Finally, using LDBLX as a proxy for the process we went through earlier, we achieved a 3x speedup by using a switch versus a map, unifying <u8> memory reads into < | ||
| - | |||
| - | I wouldn' | ||
| - | |||
| - | == DEC | ||
| - | A loop with 20xDEC had a high mark of 104.7 MIPS. | ||
| - | |||
| - | == PUSH/POP | ||
| - | * PUSH and POP are slower operations, in the 80-85 MIPS range. | ||
| - | * But PUSHA/POPA are noticeably slow, in the 27 MIPS range. | ||
| - | * Using PUSHA/POPA everywhere will kill performance. We saw a 25% increase in speed after moving from PUSHA to push (reg). | ||
sd-8516_isa_profile.1768875019.txt.gz · Last modified: by appledog
