flag_operations_are_free
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| flag_operations_are_free [2026/01/11 16:20] – appledog | flag_operations_are_free [2026/01/15 05:32] (current) – appledog | ||
|---|---|---|---|
| Line 15: | Line 15: | ||
| Here's LDAL-1: | Here's LDAL-1: | ||
| - | < | + | < |
| ; Test program: 1 million LDAL [$1000] operations | ; Test program: 1 million LDAL [$1000] operations | ||
| ; Uses CD (32-bit "count down" counter register) | ; Uses CD (32-bit "count down" counter register) | ||
| Line 41: | Line 41: | ||
| Next I moved to LDAL-3 where I removed the CMP since it was no longer needed: | Next I moved to LDAL-3 where I removed the CMP since it was no longer needed: | ||
| - | < | + | < |
| ; Test program: 1 million LDAL [$1000] operations | ; Test program: 1 million LDAL [$1000] operations | ||
| ; Uses CD (32-bit "count down" counter register) | ; Uses CD (32-bit "count down" counter register) | ||
| Line 57: | Line 57: | ||
| HALT ; Halt when done | HALT ; Halt when done | ||
| - | </code> | + | </codify> |
| Now this was a real eye opener. Removing the explicit check and keeping the flag ops ON, resulted in a MIPS score of 2.1! Well now, this was surprising but not entirely unexpected. Well, no, it was unexpected. Removing flag operations for LD and DEC is significant as they are both being executed 1 million times each. Here's the code that we're talking about: | Now this was a real eye opener. Removing the explicit check and keeping the flag ops ON, resulted in a MIPS score of 2.1! Well now, this was surprising but not entirely unexpected. Well, no, it was unexpected. Removing flag operations for LD and DEC is significant as they are both being executed 1 million times each. Here's the code that we're talking about: | ||
| Line 67: | Line 67: | ||
| * OVERFLOW_FLAG = value === 0x8000; | * OVERFLOW_FLAG = value === 0x8000; | ||
| - | That is a significant amount of flags. Having this make no impact whatsoever was surprising, so I removed the IF statements blocking these flags on DEC. This produces LDAL-2b, which surprised me by getting again the exact same 2.1 MIPS. So, over 2 million if statements wasn't moving the needle? | + | That is a significant amount of code to remove, but ONE compare op was killing it. Having this make no impact whatsoever was surprising, so I removed the IF statements blocking these flags on DEC. This produces LDAL-2b, which surprised me by getting again the exact same 2.1 MIPS. So, over 2 million if statements |
| - | I replaced the flag fences and I created LDAL-3; this time, I had only 100, | + | I replaced the flag fences and I created LDAL-3; this time, I had 100, |
| - | For the record, I created versions which used LDA and LDAB | + | It was only when I began profiling |
| + | <codify armasm> | ||
| + | ; Test program: 1 million LDAL [$1000] operations | ||
| + | ; Uses CD (32-bit "count down" counter register) | ||
| - | 78 MIPS With SEF & CMP CD, 0 | + | .address $010100 |
| - | 73 MIPS With CLF & no CMP | + | |
| + | LDCD #1000 ; Load 1,000 into CD (0x989680 is 10 mil) | ||
| + | |||
| + | loop: | ||
| + | LDAB [$1000] | ||
| + | DEC CD ; Decrement CD | ||
| + | CMP CD, 0 | ||
| + | JNZ loop ; Jump to loop if CD != 0 | ||
| + | |||
| + | HALT ; Halt when done | ||
| + | </ | ||
| + | |||
| + | Finally, I wanted to find out why they were operating so slowly compared to LDAL. This led me to discover that load< | ||
| + | |||
| + | In closing, here's the 87.5 MIPS version of LDA16 [$addr], the workhorse of this 16 bit system: | ||
| + | |||
| + | <codify typescript> | ||
| + | case OP.LD_MEM: { | ||
| + | // Load reg (1 byte) + addr (3 bytes) = 4 bytes total | ||
| + | let instruction = load< | ||
| + | let reg:u8 = instruction as u8; // Extract low byte | ||
| + | let addr = (instruction >> 8) & 0x00FFFFFF; | ||
| + | // Pre-load 32 bits from target address | ||
| + | let value = load< | ||
| + | let reg_index = reg & 0x0F; // Extract physical register 0-15 | ||
| + | IP += 4; | ||
| + | |||
| + | if (reg < 16) { | ||
| + | set_register_16bit(reg, | ||
| + | ZERO_FLAG = value === 0; | ||
| + | NEGATIVE_FLAG = (value & 0x8000) !== 0; | ||
| + | //if (DEBUG) log(`$${hex24(IP_now)} | ||
| + | } else if (reg < 48) { | ||
| + | set_register_8bit(reg, | ||
| + | ZERO_FLAG = value === 0; | ||
| + | NEGATIVE_FLAG = (value & 0x80) !== 0; | ||
| + | //if (DEBUG) log(`$${hex24(IP_now)} | ||
| + | } else if (reg < 80) { | ||
| + | set_register_24bit(reg, | ||
| + | ZERO_FLAG = value === 0; | ||
| + | NEGATIVE_FLAG = (value & 0x800000) !== 0; | ||
| + | //if (DEBUG) log(`$${hex24(IP_now)} | ||
| + | } else { | ||
| + | set_register_32bit(reg, | ||
| + | ZERO_FLAG = value === 0; | ||
| + | NEGATIVE_FLAG = (value & 0x80000000) !== 0; | ||
| + | //if (DEBUG) log(`$${hex24(IP_now)} | ||
| + | } | ||
| + | break; | ||
| + | } | ||
| + | </ | ||
| + | |||
| + | == The Prefetch System | ||
| + | Combining the loads and bit-shifting things out worked so well I had the brilliant idea to create a prefetch system. Why not batch loads across instructions? | ||
| + | |||
| + | Here's the code, preserved for posterity, may it serve a lesson to all that too much of a good thing actually sucks: | ||
| + | |||
| + | <codify ts> | ||
| + | // ============================================================================ | ||
| + | // INSTRUCTION PREFETCH BUFFER SYSTEM | ||
| + | // ============================================================================ | ||
| + | // | ||
| + | // Purpose: Reduce memory operations by batching instruction fetches. | ||
| + | // Instead of loading 1 byte at a time from RAM, we load 4 bytes at once | ||
| + | // and extract individual bytes from the buffer as needed. | ||
| + | // | ||
| + | // Trade-off: Adds overhead from buffer management, but amortizes memory | ||
| + | // loads across multiple instruction bytes. | ||
| + | // | ||
| + | // Performance note: This system was intended to improve performance by | ||
| + | // reducing memory operations, but in practice added overhead that reduced | ||
| + | // MIPS from ~90 to ~60. WebAssembly' | ||
| + | // are already well-optimized for sequential loads. | ||
| + | // ============================================================================ | ||
| + | |||
| + | // Prefetch buffer holds up to 4 bytes of instruction stream | ||
| + | let prefetch_buffer: | ||
| + | let prefetch_valid: | ||
| + | let prefetch_offset: | ||
| + | |||
| + | /** | ||
| + | * Fetch a single byte from the instruction prefetch buffer. | ||
| + | | ||
| + | * Automatically refills the buffer when exhausted by loading 4 new bytes | ||
| + | * from RAM at the current instruction pointer (_IP). | ||
| + | | ||
| + | * @returns The next instruction byte | ||
| + | */ | ||
| + | function prefetch_byte(): | ||
| + | // Check if buffer needs refilling | ||
| + | if (prefetch_offset >= prefetch_valid) { | ||
| + | // Load 4 new bytes from instruction stream | ||
| + | prefetch_buffer = load< | ||
| + | prefetch_valid = 4; | ||
| + | prefetch_offset = 0; | ||
| + | _IP += 4; // Advance physical IP by 4 bytes | ||
| + | } | ||
| + | | ||
| + | // Extract byte at current offset (0, 8, 16, or 24 bits from right) | ||
| + | let byte = (prefetch_buffer >> (prefetch_offset * 8)) as u8; | ||
| + | prefetch_offset++; | ||
| + | | ||
| + | return byte; | ||
| + | } | ||
| + | </ | ||
| + | |||
| + | You can imagine the rest of the code -- this is enough to understand the problem: | ||
| + | |||
| + | if (prefetch_offset >= prefetch_valid) is very bad. | ||
| + | |||
| + | Let me put it this way. If I run an IF on every opcode, it slows the program by 50%, turning a 90 mips LDA benchmark into a 45 MIPS benchmark. Now, I tried a lot of different methods to try and get a better score than just using load< | ||
| + | |||
| + | In fact, the way I got to a 95 MIPS benchmark for unrolled LDA operations was to simply use fetch_byte(). If I did anything else, //including inliling the load<> | ||
| + | |||
| + | At first glance you may wonder what on earth is happening. How did I beat the 87.5 MIPS implementation above? Simple, by not trying to cheat the system. As it turns out, load<> | ||
| + | |||
| + | It's strange but true. If you factor out the load operations, trying to load everything at once adds complexity: | ||
| + | |||
| + | let addr = (instruction >> 8) & 0x00FFFFFF; | ||
| + | |||
| + | You're adding a bit shift, a bitwise AND, plus you're creating an intermediary variable access. In the end this is almost 10% slower than just calling fetch_byte(). | ||
| + | | ||
| + | == Conclusion | ||
| + | The grass is green, the sky is blue. | ||
| + | |||
| + | <codify ts> | ||
| + | export function cpu_step(): void { | ||
| + | let IP_now:u32 = _IP; | ||
| + | |||
| + | // Pre-fetch 4 bytes but only commit to using opcode initially | ||
| + | let opcode = fetch_byte(); | ||
| + | |||
| + | switch(opcode) { | ||
| + | /////////////////////////////////////////////////////////////////////// | ||
| + | // LDR/STR load and store architexture | ||
| + | /////////////////////////////////////////////////////////////////////// | ||
| + | case OP.LD_IMM: { | ||
| + | let reg = fetch_byte(); | ||
| + | |||
| + | // Determine width based on register number | ||
| + | if (reg < 16) { | ||
| + | // 16-bit register | ||
| + | let value:u16 = fetch_word(); | ||
| + | set_register(reg, | ||
| + | |||
| + | ... | ||
| + | </ | ||
| + | Nothing beats this. This is it. You can't even inline it, it messes with the compiler. | ||
| + | If you want to get faster than this, you need to rewrite the entire switch in C or maybe Rust. | ||
| + | As a result of all this, I now know that flag operations are free inside an opcode and I don't need to have a "fast flags" bit. Simply checking that bit every instruction was slowing down the system by far more than it saved. | ||
| + | //Moral: " | ||
flag_operations_are_free.1768148417.txt.gz · Last modified: by appledog
