User Tools

Site Tools


flag_operations_are_free

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
flag_operations_are_free [2026/01/15 04:57] appledogflag_operations_are_free [2026/01/15 05:32] (current) appledog
Line 188: Line 188:
     if (prefetch_offset >= prefetch_valid) is very bad.     if (prefetch_offset >= prefetch_valid) is very bad.
  
-Let me put it this way. If I run an IF on every opcode, it slows the program by 50%, turning a 90 mips LDA benchmark into a 45 MIPS benchmark. Now, I tried a lot of different methods to try and get a better score than just using load<u8> when needed. The closest I got was a branchless <u64> prefetch version of the above, which got around 84 MIPS. Even just trying to load everything at the top of the loop (a u32 load) and bit-roating out the operands was no more than a 1% improvement (i.e. not worth the trouble, frankly). I got more juice out of in-lining the loads and managing _IP inside the LD opcode than anything else+Let me put it this way. If I run an IF on every opcode, it slows the program by 50%, turning a 90 mips LDA benchmark into a 45 MIPS benchmark. Now, I tried a lot of different methods to try and get a better score than just using load<u8> when needed. The closest I got was a branchless <u64> prefetch version of the above, which got around 84 MIPS. Even just trying to load everything at the top of the loop (a u32 load) and bit-roating out the operands was no more than a 1% improvement (i.e. not worth the trouble, frankly). 
 + 
 +In fact, the way I got to a 95 MIPS benchmark for unrolled LDA operations was to simply use fetch_byte(). If I did anything else, //including inliling the load<> operations//, the program would run slower. 
 + 
 +At first glance you may wonder what on earth is happening. How did I beat the 87.5 MIPS implementation above? Simple, by not trying to cheat the system. As it turns out, load<> is already as optimized as it is going to get in Web Assembly. The epiphany is, we're running a simulation. And the host computer is actually doing a good job of helping us access memory. Any abstraction we put on top ends up getting in the way. 
 + 
 +It's strange but true. If you factor out the load operations, trying to load everything at once adds complexity: 
 + 
 +    let addr = (instruction >> 8) & 0x00FFFFFF;      // Extract upper 3 bytes 
 + 
 +You're adding a bit shift, a bitwise AND, plus you're creating an intermediary variable access. In the end this is almost 10% slower than just calling fetch_byte(). 
 +     
 +== Conclusion 
 +The grass is green, the sky is blue. 
 + 
 +<codify ts> 
 +export function cpu_step(): void { 
 +    let IP_now:u32 = _IP
 + 
 +    // Pre-fetch 4 bytes but only commit to using opcode initially 
 +    let opcode = fetch_byte(); 
 + 
 +    switch(opcode) { 
 +        /////////////////////////////////////////////////////////////////////// 
 +        // LDR/STR load and store architexture                               // 
 +        /////////////////////////////////////////////////////////////////////// 
 +        case OP.LD_IMM: { 
 +            let reg = fetch_byte(); 
 + 
 +            // Determine width based on register number 
 +            if (reg < 16) { 
 +                // 16-bit register 
 +                let value:u16 = fetch_word(); 
 +                set_register(reg, value); 
 + 
 +... 
 +</codify> 
 + 
 +Nothing beats this. This is it. You can't even inline it, it messes with the compiler. 
 + 
 +If you want to get faster than this, you need to rewrite the entire switch in C or maybe Rust. 
 + 
 +As a result of all this, I now know that flag operations are free inside an opcode and I don't need to have a "fast flags" bit. Simply checking that bit every instruction was slowing down the system by far more than it saved. 
 + 
 +//Moral: "Premature optimization is the root of all evil."//
flag_operations_are_free.1768453073.txt.gz · Last modified: by appledog

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki