Doom for Nintendo 64

jnmartin84 · Dec 12, 2017

I just pushed updated code along with the dependencies that had been missing for the last three years and an updated 64Doom ROM Builder Toolkit to GitHub:

https://github.com/jnmartin84/64doom

Enjoy.

jnmartin84 · Dec 14, 2017

As for changes to 64Doom in this release:

* very fast assembly implements of memcpy and memset lifted from MIPS Technologies code in GNU Lib C and Android

* optimal hand-rolled assembly routines for endianness conversion -- SwapSHORT and SwapLONG -- that I can not find any way to make better / shorter; they are basically one MIPS instruction per C operation at this point (see "m_swap.S")

* optimal-ish hand-rolled assembly routine for FixedMul, didn't get around to FixedDiv yet but will push a version soon (see "m_fixedmul.S")

* hand-rolled assembly for R_DrawColumn and R_DrawSpan, the two functions responsible for drawing the entire in-game display except for the status bar. I took a lot of time to write these such that they have no prologue/epilogue and spill no registers at any time. They only use caller-saved registers along with the argument and return value registers so nothing has to be saved or restored. They have early-return conditions that can be met after executing 4 to 5 instructions for columns and spans smaller than a pixel. Also, their texture-mapping loops (the "do { ... } while(count--);" loops at the end of the C versions) are shorter than any GCC-generated code at any optimization level and the entire functions themselves are shorter than any GCC-generated code by 20 - 30 instructions each (see "R_DrawColumn.S" and "R_DrawSpan.S")

jnmartin84 · Dec 15, 2017

Got FixedDiv implemented in assembly (and working... ) but have not attempted any optimization yet:

Code:

FixedDiv:
        .global FixedDiv
        .set    noreorder
        .set    nomacro

        sra     t0,     a0,     31
        xor     t1,     a0,     t0
        sub     t1,     t1,     t0
        sra     t2,     a1,     31
        xor     t3,     a1,     t2
        sub     t3,     t3,     t2
        srl     t1,     t1,     14
        slt     t4,     t3,     t1
        bne     t4,     zero,   _FixedDiv_test
        xor     t0,     a0,     a1
        dadd    a0,     a0,     zero
        dadd    a1,     a1,     zero
        dsll    a0,     a0,     16
        ddiv    zero,   a0,     a1
        mflo    v0

_FixedDiv_end:
        jr      ra
        nop

_FixedDiv_test:
        bltz    t0,     _FixedDiv_return_INT_MIN
        lui     v0,     0x7FFF

_FixedDiv_return_INT_MAX:
        addiu   v0,     v0,     0xFFFF
        jr      ra
        nop

_FixedDiv_return_INT_MIN:
        addi    v0,     zero,   0x8000
        sll     v0,     v0,     16
        jr      ra
        nop

jnmartin84 · Dec 19, 2017

There's little in the way of optimizing that can be done with that code that I can find with what I know. It matches almost identically except for like 3 or 4 instructions gcc -O2 output of the C version I translated by hand.

jollyroger · Dec 16, 2017

Have you considering using a superoptimizer? Sometimes a global search may uncover a sequence of instructions that, while appearing less optimal, may yield better overall performance. Mind you, this happens more frequently on CICS processors than RISC ones...

jnmartin84 · Dec 19, 2017

jollyroger said: ↑

Have you considering using a superoptimizer? Sometimes a global search may uncover a sequence of instructions that, while appearing less optimal, may yield better overall performance. Mind you, this happens more frequently on CICS processors than RISC ones...
Click to expand...

What do you mean by superoptimizer? Do you mean something along the lines of building with Link Time Optimization where the entire set of modules is stuck together as a single IR module and the optimizer runs over that?

jnmartin84 · Dec 20, 2017

As an aside, has anyone attempted to use the updated toolkit to build and run this release of 64Doom?

If anyone wants to capture footage from a real console, I'll feed you beta copies of the true-color version which also corresponds to my current new-features branch and is usually more exciting than the release.

Borman? ;-)

Borman · Dec 20, 2017

Holidays making it rough, but its on my list of things I want to do

jnmartin84 · Dec 20, 2017

Borman said: ↑

Holidays making it rough, but its on my list of things I want to do
Click to expand...

Oh yeah I definitely understand

jollyroger · Dec 21, 2017

jnmartin84 said: ↑

What do you mean by superoptimizer? Do you mean something along the lines of building with Link Time Optimization where the entire set of modules is stuck together as a single IR module and the optimizer runs over that?
Click to expand...

Superoptimization refers to compilers that given an instruction set, run global optimization algorithms to devise the sequence of instructions that obtain a given result fastest. A superoptimizer can for example substitute instructions that provide similar functionality, reorder them, etc.

See here: https://en.wikipedia.org/wiki/Superoptimization

jnmartin84 · Jan 1, 2018

I read the link. That's basically what I'm doing by hand (getting the optimal instruction sequence for each straight-line run of code (basic blocks not including control transfer).

The MIPS ISA is basically loads, stores, logic gate operations and a few higher level abstractions wrapped in an ALU.

See m_swap.S , SwapLONG for a decent example of how I've gotten something with multiple variations into the most optimal I could find.

Hell, even SwapSHORT is as minimal as it can get - 1 MIPS instruction for 1 C operation.

jnmartin84 · Jan 1, 2018

My assembly versions of R_DrawColumn and R_DrawSpan are more "superoptimized" than what GCC produces at any optimisation level.

The whole functions dispense with need for or use of frame pointer, no register spilling whatsoever. They use only temporary registers, a lot of them, to avoid the need to spill.

They still follow the MIPS C ABI to the letter of the spec.

They make no function calls, use delay slots for useful work, return after 5 instructions when too small to draw and have almost 10 fewer instructions in their inner loop body compared to any GCC generated code. Altogether each function is 20 - 30 instructions shorter than any GCC generated code.

Also, my code schedules every last instruction present to avoid all pipeline stalls from memory loads.

I verified from documentation that there is memory-memory forwarding for potential data hazards like the use of "t0" in the following idiom seen at the end of each texture-mapping inner loop:
Code:
LB t0, 0(t8)
SB t0, 0(t9)
Memory-memory forwarding from 0(t8) to 0(t9) allows this to work without extra stalling required while also allowing the use of the value in "t0" in the instruction following the "SB" instruction without occurring another hazard/penalty/stall, for example:
Code:
# HAZARD
# second use of t0 causes stall
LB t0, 8(t8)
ADDIU t1, t0, 4

# no hazard
# m-m fwd'ing
# no use of t0 causes stall
LB t0, 8(t8)
SB t0, 8(t9)
ADDIU t1, t0, 4

Doom for Nintendo 64

jnmartin84 Robust Member

jnmartin84 Robust Member

jnmartin84 Robust Member

Attached Files:

0338.png

jnmartin84 Robust Member

jollyroger Gutsy Member

jnmartin84 Robust Member

jnmartin84 Robust Member

Borman Digital Games Curator

jnmartin84 Robust Member

jollyroger Gutsy Member

jnmartin84 Robust Member

jnmartin84 Robust Member

Share This Page

Doom for Nintendo 64

jnmartin84 Robust Member

jnmartin84 Robust Member

jnmartin84 Robust Member

Attached Files:

0338.png

jnmartin84 Robust Member

jollyroger Gutsy Member

jnmartin84 Robust Member

jnmartin84 Robust Member

Borman Digital Games Curator

jnmartin84 Robust Member

jollyroger Gutsy Member

jnmartin84 Robust Member

jnmartin84 Robust Member

Share This Page

Useful Searches