At: 19:00 minutes Bill Guschwan @mistagogue was a tool dev at Sony He's talking about the DMA (bus) while double buffer swapping. "Load up the polygon transfer loop, staying off the bus so the graphics engine could use it." What does he mean? load the vertices data into the CPU cache, right after a swap?
He means two things: 1) The cache trick is effective on parts of code that are consecutively stored. In other words, make a thick block of 4 KB with no calls and you're good to go for better speeds in general. 2) He also mentions parallelism with GTE and CPU working together, so you can snag into the code some CPU operations while the GTE is doing its job. The question is all about using GTE macros to fill polygon data rather than PSY-Q's functions, because the latter would break the 4 KB chain. In other words, don't bother with SDK functions and do all drawing at the lowest level and with your own code. Fun fact: LIBHMD, despite being Sony's code and supposedly optimized, runs like crap and barely bothers to use any parallelism and in some cases completely kills cache optimizations.
ha cool! thanks // saves the current stack pointer and loads in the new one, __asm__ volatile ("sw $29,(savesp)"); __asm__ volatile ("la $29,0x1f8003f0"); // function Call with no calls to other functions cacheFunction(ArrayCache); // restores the old stack pointer. __asm__ volatile ("lw $29,(savesp)");
It's been a while since I looked at the PSX low-level stuff, but from what I recall the GTE takes a pointer to a draw list at the start of each frame. What he seems to be saying is keep your polygon-to-drawlist code branch-free, under 4k and in I-cache so that you don't have to reach out touch the data bus to generate your drawlists on every frame. This keeps the data bus from stalling and the DMA hardware free to do other things. I'm curious however, since the PSX was designed at a very low level to be a cooperatively multitasking system, how context switches affect what is in the I-cache. I would imagine that you would have a pretty high eviction rate, which would render the argument moot. If you are single threaded it shouldn't be a problem though.
What you posted is more like the usual trick to implement the scratchpad as stack pointer, which speeds up most operations requiring the stack to store structures, such as matrices. The 4 KB cache always works on any code that lives in the block, no matter what the stack value is. In other words, if you have code that doesn't use the stack but just register, it would work exactly the same. Nah, the GTE is simply used for specific instructions, like matrix manipulation, linear interpolation, and coordinate-to-screen transformations. Draw lists are handed either at CPU level or via DMA, depending on whatever you need to compute. For example, you can use the DMA to build an OTag implementing reverse order, while on the CPU you'd have a "normal" sort/clear direction. In these cases, the GTE is never used for anything because its implementation was aimed at other purposes. Not just the DMA, but everything that would require any form of parallelism. The GTE takes a number of cycles for each operation, there should be some official documents telling you exactly how many cycles a GTE operaration takes to complete. In the meantime, you can use the CPU to do some other work in cycle steal mode because all GTE operations are non-blocking. All the operations must be single threaded, of course (the BIOS can implement a scheduler, extremely similar to Fibers). Usually parallelisms between the coprocessors are very limited to the current scope. Say, you have MDEC operations pending, in the meantime you can execute some logic and wrap it around a poll loop for other stuff executing in background, while the main operation is still doing its load of work.
You can even start loading up values for the next GTE instruction while the previous one is running. You should only do it with instructions that don't share inputs or outputs, but you can probably get away with it if you get the timing right. GTE only stalls the CPU if you execute an instruction or read a register while an instruction is running.
Thanks guys, Yes, that code is using the scratch pad as a stack, which is only 1k (Data-Cache, this is program data & BSS? no function calls nor large loops), thanks Gemini IE: if small 3D obj is small enough it can be used with GsSortObject4 . The Instruction-Cache is 4k: Program 'text' right? Which is used automatically used but is interrupted (cache miss) by function calls (including Sony lib functions) and/or loops. So hardware wise: The CPU with I & D cache. The GTE is really a maths co-processor. The GPU is a graphics drawer/rasterizer (with 2Kb texture cache). DMA controls the bus between the three processors. So "Load up the polygon transfer loop, staying off the bus so the graphics engine could use it." in layman term he means.. Fill the OT (GsSortObject4 ) in one go working the GTE/CPU ( preferably ordering objects which require the same textures together) while the GPU is drawing I need to read: Everything You Have Always Wanted to Know about the Playstation But Were Afraid to Ask. Complicated subject, I appreciate the input, it'll take me a while to process it
Pretty much, yeah. Sony docs suggest to execute most CPU code during longer GTE operations. Usually these have a "_b" macro counterpart where it drops one or two nops. RTPx, NCLIP, NCCx, and NCDx are typical examples of these operations. dumpsx is also a good benchmark tool to understand if you're doing parallel operations correctly: it simply throws an error if you're using the GTE too often, which would result in an exception IIRC. Of course, you still need some logic check to make the best usage of optimizations, but once you figure out how it works it's a piece of cake. Data cache doesn't really change much in terms of performance, depending on what you need. The only difference is the scratchpad reading almost 5 times faster than regular ram. It's a godsend for critical code of all kinds requiring frequent memory reads/writes, like a sort algorithm. Better avoid those. TMD is a terrible format, bloated like hell in terms of storage and code binaries. It's best to just come up with a stripped format supporting similar features, that way you can keep ram usage at bare minimum. Yup, code is stored in text segments, but technically it doesn't make any difference, especially when you have overlays. It's just a compiler thing. Technically the GPU DMA executes drawing while other code is being executed, but that usually comes when you sort OTag lists somewhere at the end of a main loop. The GTE parts are usually a step before that or somewhere after your logic. Or just read the official docs that come with PSY-Q. Some are weird and awkward, while others contain precious info about internals and specs. For example, the overview document has a chapter dedicated to explaining how to optimize space on disk without the need of a 30 MB padding file appended at the end; loads of info in there, even some of the official formats explained in detail. Joshua Walker's doc is probably too hard core and in some case you don't need to know too much, just like devs back in the day of PSX development. Say, you don't need to know too much about the SPU missing register interaction; Sony's doc would just tell you "if you have problems, try this other approach instead", which does explain the problem but doesn't bother to get too much in details.
The data cache is the scratchpad, the LR33300 that the PS1 CPU is based on could use it either as cache or scratchpad but the modifications that Sony disabled using it as a cache. There is a FIFO write cache. So you can get a speed up if you spread out your writes. I've never seen an exception. on real hardware AFAIK if you talk to GTE too quickly (i.e. execute an op and then read the results immediately) then the stalling mechanism just fails as it hasn't registered that the opcode has started when you try to read. GTE has a lot of edge conditions, all you can do is follow Sony's recommendations to avoid them. Some of the opcodes don't work right, so they aren't documented etc.
Yeah, I mixed that up with GTE initialization. That's the only case where accessing GTE instructions would fire an exception, my bad. I wasn't aware about extra opcodes to operate the GTE.
The first 6 bits are an index into a microcode table of some sort, it looks like the microcode for each is a fixed size because some of the big ones don't have a valid operation following them. There is one case however where the last stage of one operation is used as the operation, it's a matrix followed by a single (this indicates that the matrix operations don't use a loop, the microcode is simply duplicated.). Apart from that one case, jumping into the middle of the matrix opcodes doesn't do anything particularly useful. The shift flag (bit 19) is also active on some operations that aren't officially documented, in some cases there are bugs where flag calculations don't respect the shift flag etc. This indicates that the flag calculation and the result calculation are independent from each other. It may be that the non working shift flag variant wasn't deemed important and making it not work meant that the important variant was faster, or whether it's a simple bug that rather than getting it fixed they just ignored it existed. There are a few other "bugs" Because the flag calculation and results are calculated independently, there are some edge conditions due to how they are calculated where a bit is or isn't set when it should be, they either hoped that nobody would notice them or maybe they didn't notice them until it was too late. Once the console was in final form then changing it would cause more problems than it fixed. The GTE can't divide numbers properly, some of the results are horribly inaccurate. It's probably they did it for speed, I don't know how careful they were with the approximations (it's partly table based) and I remember someone say that one of the entries was wrong but it may have been wrong in some circumstances whatever they put in the table. If an exception occurs when the next instruction is for the GTE then it gets executed, in normal circumstances the exception will be an interrupt. This appears to be normal for coprocessors, but usually coprocessor opcodes are repeatable as their input and output registers are separate. This isn't the case for GTE where some registers are used as inputs and outputs, so the BIOS interrupt handler tries to detect this and skip running the instruction when the exception finishes. On some revisions the code is wrong and therefore never actually detects the circumstance (I don't know if it there is the chance of a false positive). The official library code detects that bios revision and patches the code in ram, but there were games released before they realised which are probably affected. I believe this is the reason why you shouldn't put a GTE operation in a branch delay slot. If an interrupt fired then the GTE operation would start, the CPU would notice it's in a branch delay slot and back up the instruction pointer and the BIOS would look and see that the next instruction isn't for the GTE so it would run the branch and the GTE instruction would run a second time. The BIOS could emulate the branch, because the LSI chip that the CPU is based on keeps track of whether the branch would be taken or not, but I suspect the person writing the code didn't know that (an R3000 doesn't have this) or they thought it wasn't worth the effort and just telling people not to do it was safer.
I made some research into GTE opcodes and apparently DMPSX has at least 4 of them nowhere listed in the inline headers. These are what I could dig from a disassembly of the full thing: Code: GTE_tbl[val_to_tbl(0x13FF)]= 0x4AA00428; // undocumented GTE_tbl[val_to_tbl(0x143F)]= 0x4B70000C; // undocumented GTE_tbl[val_to_tbl(0x147F)]= 0x4B90003D; // undocumented GTE_tbl[val_to_tbl(0x14BF)]= 0x4BA0003E; // undocumented Where val_to_tab corresponds to this code: Code: static __inline u32 val_to_tbl(u32 val) { return (val & 0x3FFC0) >> 6; } The parameter passed to it returns a simple index for GTE_tbl, which is an array used by the program to replace cop opcodes over fake opcodes used by the inlined assembly passed on to DMPSX. Does that ring any bell in regard to those undocumented commands?
Some of those bits appear to be irrelevant to console hardware. The third digit (A/7/9/A) may be used by prototype hardware or a debugger. I believe this is roughly what the PS1 makes of that when its running code. Code: switch((op>>26)&63) { case 0x12: // COP2 switch((op>>21)&31) { case 0x00: // MFC case 0x02: // CFC case 0x04: // MTC case 0x06: // CTC case 0x08: // BC case 0x0c: // BC break; default: // GTE int lm=(op&0x400); switch(op&63) { case 0x00: // RTPS (undocumented, probably just nop's in microcode and just takes longer) case 0x01: // RTPS case 0x06: // NCLIP case 0x0c: // OP case 0x10: // DPCS case 0x11: // INTPL case 0x12: // MVMVA case 0x13: // NCDS case 0x14: // CDP case 0x16: // NCDT case 0x1a: // DPCL (undocumented, the last part of NCDT is the same as DPCL, so you appear to be jumping into the middle of the microcode) case 0x1b: // NCCS case 0x1c: // CC case 0x1e: // NCS case 0x20: // NCT case 0x28: // SQR case 0x29: // DPCL case 0x2a: // DPCT case 0x2d: // AVSZ3 case 0x2e: // AVSZ4 case 0x30: // RTPT case 0x3d: // GPF case 0x3e: // GPL case 0x3f: // NCCT break; default: // weird behaviour that hasn't been reverse engineered yet but doesn't appear to do anything particularly useful. } } } The only open source gte to pass amidog's stress test (http://psx.amidog.se/doku.php?id=psx:download:gte) Can be found at: https://github.com/mamedev/mame/blob/master/src/devices/cpu/psx/gte.cpp https://github.com/mamedev/mame/blob/master/src/devices/cpu/psx/gte.h https://github.com/mamedev/mame/blob/master/src/devices/cpu/psx/psx.cpp https://github.com/mamedev/mame/blob/master/src/devices/cpu/psx/psx.h The emulation is only as good as the stress test. I don't know how comprehensive the stress test is, because it's closed source and I haven't looked at what opcodes it runs in a very long time. At some point I'd like to see an emulation that behaves more like the real GTE, by splitting the code up into chunks that are executed a cycle at a time. It won't help games that follow the rules properly, but right now you can certainly write software that only works on real hardware or only on an emulator & I don't like that. I've never looked at DMPSX, it seemed odd to me to include an extra level of obsfucation. However DMPSX is supposed to do some simple checks to prevent you doing things that will cause a problem for GTE and if you knew the opcodes you wouldn't run DMPSX. If you have to reverse engineer the opcodes before you can bypass DMPSX then you will probably use DMPSX. If I was writing a game using the official tools then I might use it, as long as DMPSX was a 32bit application (as you can't run 16 bit apps on 64 bit windows).
Well, DMPSX is nothing more than a small hack at object level after all. I'm not sure how practical it would be to alter binutils and gcc to fire GTE warnings or errors on r3000. DMPSX runs fine even on 64 bit Windows, so yeah, it's not that ancient at all. I was looking into it for two reasons: document a bit more the PSY-Q object/lib format and come up with a good implementation of DMPSX to be used on newer toolchains. Turns out it doesn't help that much with the object format, but at least the program itself is simple enough to make a working reproduction. EDIT: Turns out those undocumented fake opcodes in DMPSX were just copies of other instructions. Code: GTE_tbl[val_to_tbl(0x13FF)]= 0x4AA00428; // not in macro [sqr0] GTE_tbl[val_to_tbl(0x143F)]= 0x4B70000C; // not in macro [op0] GTE_tbl[val_to_tbl(0x147F)]= 0x4B90003D; // not in macro [gpf0] GTE_tbl[val_to_tbl(0x14BF)]= 0x4BA0003E; // not in macro [gpl0] However, there are still a few more that aren't copies: Code: GTE_tbl[val_to_tbl(0x47F)] = 0x4A484012; // not used in macro GTE_tbl[val_to_tbl(0x4BF)] = 0x4A48C012; // not used in macro GTE_tbl[val_to_tbl(0x4FF)] = 0x4A494012; // not used in macro GTE_tbl[val_to_tbl(0x53F)] = 0x4A49C012; // not used in macro GTE_tbl[val_to_tbl(0x8BF)] = 0x4A4A4012; // not used in macro GTE_tbl[val_to_tbl(0x8FF)] = 0x4A4AC012; // not used in macro GTE_tbl[val_to_tbl(0x93F)] = 0x4A4B4012; // not used in macro GTE_tbl[val_to_tbl(0x97F)] = 0x4A4BC012; // not used in macro GTE_tbl[val_to_tbl(0xCFF)] = 0x4A4C4012; // not used in macro GTE_tbl[val_to_tbl(0xD3F)] = 0x4A4CC012; // not used in macro GTE_tbl[val_to_tbl(0xD7F)] = 0x4A4D4012; // not used in macro GTE_tbl[val_to_tbl(0xDBF)] = 0x4A4DC012; // not used in macro
These are all mvmva, there are a lot of variants of that because it has more inputs than any other gte instruction. It's possible that DMPSX supports them all, but the macros don't. Or maybe there is a generic mvmva macro.
Those are definitively mvmva variations, yet none of them correspond to the usual way fake opcodes are assembled or any known encoded versions. Usually a generic mvmva with all parameters corresponds to 0x13bf|sf<<25|mx<<23|v<<21|cv<<19|lm<<18, so it's definitively something else. I should probably look in some older version of PSY-Q, could be deprecated instructions internally kept for legacy.