flag_operations_are_free
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| flag_operations_are_free [2026/01/15 01:00] – appledog | flag_operations_are_free [2026/01/15 05:32] (current) – appledog | ||
|---|---|---|---|
| Line 168: | Line 168: | ||
| function prefetch_byte(): | function prefetch_byte(): | ||
| // Check if buffer needs refilling | // Check if buffer needs refilling | ||
| - | | + | if (prefetch_offset >= prefetch_valid) { |
| // Load 4 new bytes from instruction stream | // Load 4 new bytes from instruction stream | ||
| prefetch_buffer = load< | prefetch_buffer = load< | ||
| Line 184: | Line 184: | ||
| </ | </ | ||
| - | You can imagine the rest of the code -- this is enough to understand the problem. | + | You can imagine the rest of the code -- this is enough to understand the problem: |
| + | |||
| + | if (prefetch_offset >= prefetch_valid) is very bad. | ||
| + | |||
| + | Let me put it this way. If I run an IF on every opcode, it slows the program by 50%, turning a 90 mips LDA benchmark into a 45 MIPS benchmark. Now, I tried a lot of different methods to try and get a better score than just using load< | ||
| + | |||
| + | In fact, the way I got to a 95 MIPS benchmark for unrolled LDA operations was to simply use fetch_byte(). If I did anything else, //including inliling the load<> | ||
| + | |||
| + | At first glance you may wonder what on earth is happening. How did I beat the 87.5 MIPS implementation above? Simple, by not trying to cheat the system. As it turns out, load<> | ||
| + | |||
| + | It's strange but true. If you factor out the load operations, trying to load everything at once adds complexity: | ||
| + | |||
| + | let addr = (instruction >> 8) & 0x00FFFFFF; | ||
| + | |||
| + | You're adding a bit shift, a bitwise AND, plus you're creating an intermediary variable access. In the end this is almost 10% slower than just calling fetch_byte(). | ||
| + | |||
| + | == Conclusion | ||
| + | The grass is green, the sky is blue. | ||
| + | |||
| + | <codify ts> | ||
| + | export function cpu_step(): void { | ||
| + | let IP_now:u32 = _IP; | ||
| + | |||
| + | // Pre-fetch 4 bytes but only commit to using opcode initially | ||
| + | let opcode = fetch_byte(); | ||
| + | |||
| + | switch(opcode) { | ||
| + | /////////////////////////////////////////////////////////////////////// | ||
| + | // LDR/STR load and store architexture | ||
| + | /////////////////////////////////////////////////////////////////////// | ||
| + | case OP.LD_IMM: { | ||
| + | let reg = fetch_byte(); | ||
| + | |||
| + | // Determine width based on register number | ||
| + | if (reg < 16) { | ||
| + | // 16-bit register | ||
| + | let value:u16 = fetch_word(); | ||
| + | set_register(reg, | ||
| + | |||
| + | ... | ||
| + | </ | ||
| + | |||
| + | Nothing beats this. This is it. You can't even inline it, it messes with the compiler. | ||
| + | |||
| + | If you want to get faster than this, you need to rewrite the entire switch in C or maybe Rust. | ||
| + | |||
| + | As a result of all this, I now know that flag operations are free inside an opcode and I don't need to have a "fast flags" bit. Simply checking that bit every instruction was slowing down the system by far more than it saved. | ||
| + | |||
| + | //Moral: " | ||
flag_operations_are_free.1768438841.txt.gz · Last modified: by appledog
