I just pushed updated code along with the dependencies that had been missing for the last three years and an updated 64Doom ROM Builder Toolkit to GitHub: https://github.com/jnmartin84/64doom Enjoy.
As for changes to 64Doom in this release: * very fast assembly implements of memcpy and memset lifted from MIPS Technologies code in GNU Lib C and Android * optimal hand-rolled assembly routines for endianness conversion -- SwapSHORT and SwapLONG -- that I can not find any way to make better / shorter; they are basically one MIPS instruction per C operation at this point (see "m_swap.S") * optimal-ish hand-rolled assembly routine for FixedMul, didn't get around to FixedDiv yet but will push a version soon (see "m_fixedmul.S") * hand-rolled assembly for R_DrawColumn and R_DrawSpan, the two functions responsible for drawing the entire in-game display except for the status bar. I took a lot of time to write these such that they have no prologue/epilogue and spill no registers at any time. They only use caller-saved registers along with the argument and return value registers so nothing has to be saved or restored. They have early-return conditions that can be met after executing 4 to 5 instructions for columns and spans smaller than a pixel. Also, their texture-mapping loops (the "do { ... } while(count--);" loops at the end of the C versions) are shorter than any GCC-generated code at any optimization level and the entire functions themselves are shorter than any GCC-generated code by 20 - 30 instructions each (see "R_DrawColumn.S" and "R_DrawSpan.S")
Got FixedDiv implemented in assembly (and working... ) but have not attempted any optimization yet: Code: FixedDiv: .global FixedDiv .set noreorder .set nomacro sra t0, a0, 31 xor t1, a0, t0 sub t1, t1, t0 sra t2, a1, 31 xor t3, a1, t2 sub t3, t3, t2 srl t1, t1, 14 slt t4, t3, t1 bne t4, zero, _FixedDiv_test xor t0, a0, a1 dadd a0, a0, zero dadd a1, a1, zero dsll a0, a0, 16 ddiv zero, a0, a1 mflo v0 _FixedDiv_end: jr ra nop _FixedDiv_test: bltz t0, _FixedDiv_return_INT_MIN lui v0, 0x7FFF _FixedDiv_return_INT_MAX: addiu v0, v0, 0xFFFF jr ra nop _FixedDiv_return_INT_MIN: addi v0, zero, 0x8000 sll v0, v0, 16 jr ra nop
There's little in the way of optimizing that can be done with that code that I can find with what I know. It matches almost identically except for like 3 or 4 instructions gcc -O2 output of the C version I translated by hand.
Have you considering using a superoptimizer? Sometimes a global search may uncover a sequence of instructions that, while appearing less optimal, may yield better overall performance. Mind you, this happens more frequently on CICS processors than RISC ones...
What do you mean by superoptimizer? Do you mean something along the lines of building with Link Time Optimization where the entire set of modules is stuck together as a single IR module and the optimizer runs over that?
As an aside, has anyone attempted to use the updated toolkit to build and run this release of 64Doom? If anyone wants to capture footage from a real console, I'll feed you beta copies of the true-color version which also corresponds to my current new-features branch and is usually more exciting than the release. Borman? ;-)
Superoptimization refers to compilers that given an instruction set, run global optimization algorithms to devise the sequence of instructions that obtain a given result fastest. A superoptimizer can for example substitute instructions that provide similar functionality, reorder them, etc. See here: https://en.wikipedia.org/wiki/Superoptimization
I read the link. That's basically what I'm doing by hand (getting the optimal instruction sequence for each straight-line run of code (basic blocks not including control transfer). The MIPS ISA is basically loads, stores, logic gate operations and a few higher level abstractions wrapped in an ALU. See m_swap.S , SwapLONG for a decent example of how I've gotten something with multiple variations into the most optimal I could find. Hell, even SwapSHORT is as minimal as it can get - 1 MIPS instruction for 1 C operation.
My assembly versions of R_DrawColumn and R_DrawSpan are more "superoptimized" than what GCC produces at any optimisation level. The whole functions dispense with need for or use of frame pointer, no register spilling whatsoever. They use only temporary registers, a lot of them, to avoid the need to spill. They still follow the MIPS C ABI to the letter of the spec. They make no function calls, use delay slots for useful work, return after 5 instructions when too small to draw and have almost 10 fewer instructions in their inner loop body compared to any GCC generated code. Altogether each function is 20 - 30 instructions shorter than any GCC generated code. Also, my code schedules every last instruction present to avoid all pipeline stalls from memory loads. I verified from documentation that there is memory-memory forwarding for potential data hazards like the use of "t0" in the following idiom seen at the end of each texture-mapping inner loop: Code: LB t0, 0(t8) SB t0, 0(t9) Memory-memory forwarding from 0(t8) to 0(t9) allows this to work without extra stalling required while also allowing the use of the value in "t0" in the instruction following the "SB" instruction without occurring another hazard/penalty/stall, for example: Code: # HAZARD # second use of t0 causes stall LB t0, 8(t8) ADDIU t1, t0, 4 # no hazard # m-m fwd'ing # no use of t0 causes stall LB t0, 8(t8) SB t0, 8(t9) ADDIU t1, t0, 4