How do you like... 40.000 matrix multiplications in 1/60 sec...or else,160.000 vector transformations...That's 9.600.000 vectors per second...That's the metric i got a few minutes ago when i completed my code... If i clip the frame rate to a steady 30 FPS there will be 2xtimes the processing power,and the vector transformations i am working on will be even faster I hope the gpu transfer mechanism will be equally fast SH4 asm ownz...
damn, i'm envious. The last time I had time to optimize the last cycle out of asm code was on PSX... In the end my demo ran at ~30% cpu/gpu utilization, because there weren't better 3d models nor time to use the stuff for something else - BUT my engine was optimized to, at that time, impressive numbers (at least for me). It was a speedup in an order of magnitude against the library functions, mainly because of the super slow memory on PSX (I didn't had to write transformed data back into memory, but could fill the GPU fifo instantly). on gamecube, i never ever came close to the point where i had to optimize anything, everything is just so blazing fast and simple there (GPU takes matrices and 3d vertex data and does everything else, once it's setup, which is the only complex thing). (Not that i had any 3d models or effects which actually used the full available processing power..)
The official tools+DevBox feature a CPU simulator so you can track slowdowns at the pipeline level...Almost everything i have written is in asm and with the aid from the simulator everything is 100% pipelined...no stalls and wasted cycles As for the psx transfer thingie you mentioned,in dc you can as well use the CPU's store queue to send data directly to the gpu (32 byte writes actually to the TA fifo), so you can save time by avoiding writing back to ram and then dma-ing to the gpu... Damn the dc is a treat to work on...
Wow, respect. To me, SuperH assembly appeared to be the work of Cthulhu himself, so I quickly fled back to my trusted MIPS. I'd love to see the code though
This was the first time i wrote asm,no previous experience so i did not have any problems... void Mul4x4(Mat4x4 *dest,Mat4x4 *A,Mat4x4 *B) { ;---Load A in XMTRX.--------------------------- fschg ;double precision. mov r6,r7 ;r7=r6. frchg ;change register bank. mov r6,r8 ;r8=r6. fmov @r5+,dr0 ;load matrix A to xmtrx. add #H'10,r7 ;r7=r6+16. fmov @r5+,dr2 ;load matrix A to xmtrx. add #H'20,r8 ;r8=r6+32. fmov @r5+,dr4 ;load matrix A to xmtrx. mov #H'E0,r0 ;r0=store queue code. fmov @r5+,dr6 ;load matrix A to xmtrx. mov #H'FC,r1 ;for sq code. fmov @r5+,dr8 ;load matrix A to xmtrx. mov #H'18,r10 ;24 shifts. fmov @r5+,dr10 ;load matrix A to xmtrx. shld r10,r0 ;24 shifts completed,r0=0xE0000000. fmov @r5+,dr12 ;load matrix A to xmtrx. shad r10,r1 ;24 shifts completed,r1=0xFC000000. fmov @r5+,dr14 ;load matrix A to xmtrx. mov #H'FF,r2 ;r2=0xFFFFFFFF. frchg ;change register bank. mov r6,r9 ;r9=r6. fschg ;double precision. add #H'30,r9 ;r9=r6+48. ;---Load B Transposed+Multiply.---------------- fmov.s @r6+,fr0 ;Load B. add #H'40,r4 ;@dest+=64 bytes for the transfers later. fmov.s @r7+,fr1 ;Load B. mov #H'FF,r11 ;r11=0xFFFFFFFF. fmov.s @r8+,fr2 ;Load B. shad r10,r11 ;24 shifts completed,r11=0xFF000000. fmov.s @r9+,fr3 ;Load B. nop ;pipeline pad. mov r4,r3 ;r3=r4=@mat+64. nop ;pipeline pad. fmov.s @r6+,fr4 ;Load B. add #H'40,r11 ;r11+=64 fmov.s @r7+,fr5 ;Load B. ftrv xmtrx,fv0 ;fv0 now contains dest's first column. fmov.s @r8+,fr6 ;Load B. sub r1,r2 ;r2=0x3FFFFFF. fmov.s @r9+,fr7 ;Load B. and r4,r2 ;1st step for sq adress generation. fmov.s @r6+,fr8 ;Load B. or r2,r0 ;2nd step for sq adress generation. fmov.s @r7+,fr9 ;Load B. ftrv xmtrx,fv4 ;fv4 now contains dest's first column. fmov.s @r8+,fr10 ;Load B. shlr16 r3 ;sq unmodified address shift. fmov.s @r9+,fr11 ;Load B. shlr8 r3 ;shifts completed,info for QACRs ready. fmov.s @r6,fr12 ;Load B. nop ;pipeline pad. fmov.s @r7,fr13 ;Load B. ftrv xmtrx,fv8 ;fv8 now contains dest's first column. fmov.s @r8,fr14 ;Load B. nop ;pipeline pad. fmov.s @r9,fr15 ;Load B. nop ;pipeline pad. mov r3,@-r11 ;Fill QACR1. nop ;pipeline pad. ftrv xmtrx,fv12 ;fv12 now contains dest's first column. mov r3,@-r11 ;Fill QACR0. ;---Send Data to store queues.---------------- fschg ;64 bit transfers to sq. nop ;pipeline pad. fmov dr14,@-r0 nop fmov dr12,@-r0 nop fmov dr10,@-r0 nop fmov dr8,@-r0 nop pref @r0 ;sq write. nop ;pipeline pad. fmov dr6,@-r0 nop fmov dr4,@-r0 nop fmov dr2,@-r0 nop fmov dr0,@-r0 nop pref @r0 ;sq write. fschg ;single precision. }