flag_operations_are_free
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| flag_operations_are_free [2026/01/11 16:21] – appledog | flag_operations_are_free [2026/01/15 05:32] (current) – appledog | ||
|---|---|---|---|
| Line 15: | Line 15: | ||
| Here's LDAL-1: | Here's LDAL-1: | ||
| - | < | + | < |
| ; Test program: 1 million LDAL [$1000] operations | ; Test program: 1 million LDAL [$1000] operations | ||
| ; Uses CD (32-bit "count down" counter register) | ; Uses CD (32-bit "count down" counter register) | ||
| Line 67: | Line 67: | ||
| * OVERFLOW_FLAG = value === 0x8000; | * OVERFLOW_FLAG = value === 0x8000; | ||
| - | That is a significant amount of flags. Having this make no impact whatsoever was surprising, so I removed the IF statements blocking these flags on DEC. This produces LDAL-2b, which surprised me by getting again the exact same 2.1 MIPS. So, over 2 million if statements wasn't moving the needle? | + | That is a significant amount of code to remove, but ONE compare op was killing it. Having this make no impact whatsoever was surprising, so I removed the IF statements blocking these flags on DEC. This produces LDAL-2b, which surprised me by getting again the exact same 2.1 MIPS. So, over 2 million if statements |
| - | I replaced the flag fences and I created LDAL-3; this time, I had only 100, | + | I replaced the flag fences and I created LDAL-3; this time, I had 100, |
| - | For the record, I created versions which used LDA and LDAB | + | It was only when I began profiling |
| + | <codify armasm> | ||
| + | ; Test program: 1 million LDAL [$1000] operations | ||
| + | ; Uses CD (32-bit "count down" counter register) | ||
| - | 78 MIPS With SEF & CMP CD, 0 | + | .address $010100 |
| - | 73 MIPS With CLF & no CMP | + | |
| + | LDCD #1000 ; Load 1,000 into CD (0x989680 is 10 mil) | ||
| + | |||
| + | loop: | ||
| + | LDAB [$1000] | ||
| + | DEC CD ; Decrement CD | ||
| + | CMP CD, 0 | ||
| + | JNZ loop ; Jump to loop if CD != 0 | ||
| + | |||
| + | HALT ; Halt when done | ||
| + | </ | ||
| + | |||
| + | Finally, I wanted to find out why they were operating so slowly compared to LDAL. This led me to discover that load< | ||
| + | |||
| + | In closing, here's the 87.5 MIPS version of LDA16 [$addr], the workhorse of this 16 bit system: | ||
| + | |||
| + | <codify typescript> | ||
| + | case OP.LD_MEM: { | ||
| + | // Load reg (1 byte) + addr (3 bytes) = 4 bytes total | ||
| + | let instruction = load< | ||
| + | let reg:u8 = instruction as u8; // Extract low byte | ||
| + | let addr = (instruction >> 8) & 0x00FFFFFF; | ||
| + | // Pre-load 32 bits from target address | ||
| + | let value = load< | ||
| + | let reg_index = reg & 0x0F; // Extract physical register 0-15 | ||
| + | IP += 4; | ||
| + | |||
| + | if (reg < 16) { | ||
| + | set_register_16bit(reg, | ||
| + | ZERO_FLAG = value === 0; | ||
| + | NEGATIVE_FLAG = (value & 0x8000) !== 0; | ||
| + | //if (DEBUG) log(`$${hex24(IP_now)} | ||
| + | } else if (reg < 48) { | ||
| + | set_register_8bit(reg, | ||
| + | ZERO_FLAG = value === 0; | ||
| + | NEGATIVE_FLAG = (value & 0x80) !== 0; | ||
| + | //if (DEBUG) log(`$${hex24(IP_now)} | ||
| + | } else if (reg < 80) { | ||
| + | set_register_24bit(reg, | ||
| + | ZERO_FLAG = value === 0; | ||
| + | NEGATIVE_FLAG = (value & 0x800000) !== 0; | ||
| + | //if (DEBUG) log(`$${hex24(IP_now)} | ||
| + | } else { | ||
| + | set_register_32bit(reg, | ||
| + | ZERO_FLAG = value === 0; | ||
| + | NEGATIVE_FLAG = (value & 0x80000000) !== 0; | ||
| + | //if (DEBUG) log(`$${hex24(IP_now)} | ||
| + | } | ||
| + | break; | ||
| + | } | ||
| + | </ | ||
| + | |||
| + | == The Prefetch System | ||
| + | Combining the loads and bit-shifting things out worked so well I had the brilliant idea to create a prefetch system. Why not batch loads across instructions? | ||
| + | |||
| + | Here's the code, preserved for posterity, may it serve a lesson to all that too much of a good thing actually sucks: | ||
| + | |||
| + | <codify ts> | ||
| + | // ============================================================================ | ||
| + | // INSTRUCTION PREFETCH BUFFER SYSTEM | ||
| + | // ============================================================================ | ||
| + | // | ||
| + | // Purpose: Reduce memory operations by batching instruction fetches. | ||
| + | // Instead of loading 1 byte at a time from RAM, we load 4 bytes at once | ||
| + | // and extract individual bytes from the buffer as needed. | ||
| + | // | ||
| + | // Trade-off: Adds overhead from buffer management, but amortizes memory | ||
| + | // loads across multiple instruction bytes. | ||
| + | // | ||
| + | // Performance note: This system was intended to improve performance by | ||
| + | // reducing memory operations, but in practice added overhead that reduced | ||
| + | // MIPS from ~90 to ~60. WebAssembly' | ||
| + | // are already well-optimized for sequential loads. | ||
| + | // ============================================================================ | ||
| + | |||
| + | // Prefetch buffer holds up to 4 bytes of instruction stream | ||
| + | let prefetch_buffer: | ||
| + | let prefetch_valid: | ||
| + | let prefetch_offset: | ||
| + | |||
| + | /** | ||
| + | * Fetch a single byte from the instruction prefetch buffer. | ||
| + | | ||
| + | * Automatically refills the buffer when exhausted by loading 4 new bytes | ||
| + | * from RAM at the current instruction pointer (_IP). | ||
| + | | ||
| + | * @returns The next instruction byte | ||
| + | */ | ||
| + | function prefetch_byte(): | ||
| + | // Check if buffer needs refilling | ||
| + | if (prefetch_offset >= prefetch_valid) { | ||
| + | // Load 4 new bytes from instruction stream | ||
| + | prefetch_buffer = load< | ||
| + | prefetch_valid = 4; | ||
| + | prefetch_offset = 0; | ||
| + | _IP += 4; // Advance physical IP by 4 bytes | ||
| + | } | ||
| + | | ||
| + | // Extract byte at current offset (0, 8, 16, or 24 bits from right) | ||
| + | let byte = (prefetch_buffer >> (prefetch_offset * 8)) as u8; | ||
| + | prefetch_offset++; | ||
| + | | ||
| + | return byte; | ||
| + | } | ||
| + | </ | ||
| + | |||
| + | You can imagine the rest of the code -- this is enough to understand the problem: | ||
| + | |||
| + | if (prefetch_offset >= prefetch_valid) is very bad. | ||
| + | |||
| + | Let me put it this way. If I run an IF on every opcode, it slows the program by 50%, turning a 90 mips LDA benchmark into a 45 MIPS benchmark. Now, I tried a lot of different methods to try and get a better score than just using load< | ||
| + | |||
| + | In fact, the way I got to a 95 MIPS benchmark for unrolled LDA operations was to simply use fetch_byte(). If I did anything else, //including inliling the load<> | ||
| + | |||
| + | At first glance you may wonder what on earth is happening. How did I beat the 87.5 MIPS implementation above? Simple, by not trying to cheat the system. As it turns out, load<> | ||
| + | |||
| + | It's strange but true. If you factor out the load operations, trying to load everything at once adds complexity: | ||
| + | |||
| + | let addr = (instruction >> 8) & 0x00FFFFFF; | ||
| + | |||
| + | You're adding a bit shift, a bitwise AND, plus you're creating an intermediary variable access. In the end this is almost 10% slower than just calling fetch_byte(). | ||
| + | | ||
| + | == Conclusion | ||
| + | The grass is green, the sky is blue. | ||
| + | |||
| + | <codify ts> | ||
| + | export function cpu_step(): void { | ||
| + | let IP_now:u32 = _IP; | ||
| + | |||
| + | // Pre-fetch 4 bytes but only commit to using opcode initially | ||
| + | let opcode = fetch_byte(); | ||
| + | |||
| + | switch(opcode) { | ||
| + | /////////////////////////////////////////////////////////////////////// | ||
| + | // LDR/STR load and store architexture | ||
| + | /////////////////////////////////////////////////////////////////////// | ||
| + | case OP.LD_IMM: { | ||
| + | let reg = fetch_byte(); | ||
| + | |||
| + | // Determine width based on register number | ||
| + | if (reg < 16) { | ||
| + | // 16-bit register | ||
| + | let value:u16 = fetch_word(); | ||
| + | set_register(reg, | ||
| + | |||
| + | ... | ||
| + | </ | ||
| + | Nothing beats this. This is it. You can't even inline it, it messes with the compiler. | ||
| + | If you want to get faster than this, you need to rewrite the entire switch in C or maybe Rust. | ||
| + | As a result of all this, I now know that flag operations are free inside an opcode and I don't need to have a "fast flags" bit. Simply checking that bit every instruction was slowing down the system by far more than it saved. | ||
| + | //Moral: " | ||
flag_operations_are_free.1768148509.txt.gz · Last modified: by appledog
