To give the newer developers out there some ideas of what can be done with DECI2, I thought that I would share some of the recent bug-hunting cases. I usually don't gather data from my (long) debugging sessions because I am usually busy with cursing at the code... but today I recorded some of it down! It's a boring history lesson, but I guess that it may help if you're struggling to understand what a TOOL can be used for. For debugging, I have debugging information left intact (don't use ee-strip, use the -g option for compiling and don't pass -s to the linker). As DECI2 will clear debugging information when the IOP is reboot, I issued this command into dsidb before running HDLGameInstaller (HDLGameInstaller had its IOP reboot process disabled): Code: mstart rom0:UDNL This will cause the IOP to be reboot (and hence the DebugStation mode option will be effective), but yet the EE debugging information will be left intact. On 2017/02/11, a prototype version of HDLGameInstaller would randomly hang. After leaving to install games on a TOOL, this is visible: Code: # EEKERNEL[SignalSema] : Semaphore maxCount overflow. # SemaphoreID:0 count:2 maxCount:1 *** Unexpected reply - type=BREAKR result=EXCEPTION *** Target program stopped. Check the location by dr command. dsedb S> dr at=80020000 v0-1=00000000,00000108 a0-3=00000000,00000185,00000004,00000044 t0-7=00000000,a00269d0,00000040,a0026980, 20486f80,a0000000,00000004,0000001f s0-7=00040000,00040000,80025980,004a72e0, 00a04940,00030400,00000080,00030400 t8=80025980 t9=00030400 k0=80016ed8 k1=ffff8000 gp=00396770 sp=00399f30 fp=00399fe0 ra=00202678 lo=00000000 hi=00000001 sa=00000002 PC=002b4e28 badvaddr=00000228 badpaddr=018a8400 $cause = 0x00038424 [ CE0 EXC2=Debug IP7 IP2 EXC="Breakpoint" ] $status = 0x70030c13 [ Cu210 EDI EIE IM3 IM2 KSU=User EXL IE ] <SignalSema>: 0x002b4e20: 0x24030042 li $v1,0x42 0x002b4e24: 0x0000000c syscall 0x0 ->0x002b4e28: 0x0000000d break 0x0 0x002b4e2c: 0x00000000 nop <WaitSema>: 0x002b4e30: 0x24030044 li $v1,0x44 0x002b4e34: 0x0000000c syscall 0x0 0x002b4e38: 0x03e00008 jr $ra Oddly, semaphore 0 isn't used by the I/O thread (the thread in which the exception occurred). It appears that something has overwritten the IOState structure, which the I/O thread uses to store the IDs of the semaphores that it uses. IOState structure declaration in C: Code: struct IOState { void *buffer; void *unpackbuf; struct BuffDesc *bd; unsigned short int WritePtr, ReadPtr; unsigned short int bufmax, nbufs; unsigned char state, command; unsigned int remaining; unsigned int opt; int id, CmdAckSema, inBufSema, outBufSema, ioFD; }; Healthy IOState structure: Code: dsedb R> dw IOState 0x00399ac0: 0x00a04940 0x003b3940 0x004a72e0 0x00010001 0x00399ad0: 0x00030080 0x00000100 0x00000000 0x00000000 0x00399ae0: 0x0000001c 0x00000012 0x00000013 0x00000014 0x00399af0: 0x00000007 0x00000000 0x00000000 0x00000000 0x00399b00: 0x00000000 0x00000000 0x00000000 0x00000000 0x00399b10: 0x00000002 0x01000001 0x00000000 0x00000000 0x00399b20: 0x70070c00 0x00000000 0x00000000 0x00000000 0x00399b30: 0x00000000 0x00000000 0x00000000 0x00000000 0x00399b40: 0x00010000 0x00000000 0x9c2dc000 0xfc017c00 0x00399b50: 0x00000005 0x01000001 0x00000000 0x00000000 0x00399b60: 0x80020000 0xffffffff 0x03027559 0x02902140 0x00399b70: 0x0000000a 0x00000000 0x00000000 0x00000000 0x00399b80: 0x00000104 0x00000000 0x00000000 0x00000000 0x00399b90: 0x0000000a 0x00000000 0x00000000 0x00000000 0x00399ba0: 0x00000185 0x00000000 0x00000000 0x00000000 0x00399bb0: 0x00000004 0x00000000 0x00000000 0x00000000 Corrupted structure: Code: dsedb S> dw IOState 0x00399ac0: 0x003a0000 0x00000000 0x00000000 0x00000001 0x00399ad0: 0x00399d30 0x00000000 0x03027559 0x02902140 0x00399ae0: 0x20486f80 0x00000000 0x00000000 0x00000000 0x00399af0: 0x00000040 0x00000000 0x00000000 0x00000000 0x00399b00: 0x20486f80 0x00000000 0x00000000 0x00000000 0x00399b10: 0x00000040 0x00000000 0x00000000 0x00000000 0x00399b20: 0x00478fc0 0x00000000 0x00000000 0x00000000 0x00399b30: 0x0005cda0 0x00000000 0x00000000 0x00000000 0x00399b40: 0x00000010 0x00000000 0x00000000 0x00000000 0x00399b50: 0x00000002 0x01000001 0x00000000 0x00000000 0x00399b60: 0x80020000 0xffffffff 0x00000000 0x00000000 0x00399b70: 0x00000017 0x00000000 0x00000000 0x00000000 0x00399b80: 0x00000104 0x00000000 0x03027559 0x02902140 0x00399b90: 0x00000017 0x00000000 0x00000000 0x00000000 0x00399ba0: 0x00000185 0x00000000 0x00000000 0x00000000 0x00399bb0: 0x00000004 0x00000000 0x00000000 0x00000000 IOState exists at 0x00399ac0, while the stack of the I/O thread begins at 0x00399b00. It's worth nothing that when transfers over the network by HDLGameInstaller begin, the visible region starting from 0x00399b00 actually contains zeros. This means that nearly all (or perhaps even more than) the stack was used. At first, I didn't see a pattern to the garbage values, until I noticed some values that looked like function addresses (i.e. 0x20486f80, which was the address of one of the SIFRPC functions). I verified that they were addresses, with the di command (di 0x00486f80). Because the I/O thread's stack exists right after the IOState structure, it was my one and only (valid) suspect. As putting a hardware breakpoint (with hbp) at the start of the IOState structure resulted in lots of unnecessary breaks, I put it at 0x00399b10 instead - close to the end of the I/O thread's stack. Eventually, the kernel can be seen writing there: Code: dsedb S> dr at=00000003 v0-1=6c180002,b000c400 a0-3=00000185,00000185,00000004,00000044 t0-7=00000000,a0025c50,00000040,a0025c00, 20486f80,a0000000,00000004,0000001f s0-7=00000040,20486f80,00000050,00000002, 0005cda0,00478fc0,00000000,00234420 t8=80025980 t9=00030400 k0=00399b10 k1=ffff8000 gp=00396770 sp=8001d480 fp=00478fc0 ra=80003d04 lo=00000001 hi=00000000 sa=00000003 PC=80003e5c badvaddr=00000228 badpaddr=018a8400 $cause = 0x00038020 [ CE0 EXC2=Debug IP7 EXC="SYSCALL" ] $status = 0x70030c04 [ Cu210 EDI EIE IM3 IM2 KSU=Kernel ERL ] 0x80003e54: 0xe75f027c swc1 $fpr31,0x27c($k0) 0x80003e58: 0x00000828 mfsa $at ->0x80003e5c: 0xaf410000 sw $at,0($k0) 0x80003e60: 0x4441f800 cfc1 $at,$fcr31 0x80003e64: 0xaf410004 sw $at,4($k0) 0x80003e68: 0x3c018000 lui $at,0x8000 0x80003e6c: 0x44810800 mtc1 $at,$fpr1 I put a breakpoint right after the start of fileXioWrite, which I believed to be one of the deepest parts of the I/O thread. The stack was at 0x00399f00, which was still quite far from the end of the stack. But continuing down fileXioWrite, the numbers started to add up: SifCallRpc: +0xA0 bytes SifSendCmd: +0x10 bytes _SifSendCmd: +0xC0 bytes Total: 0x170 That would leave the stack at 0x00399d90, within _SifsendCmd. It's still quite some distance from IOState, but the PCSX2 FPS2BIOS code shows that the EE kernel saves the contexts of each thread onto their stacks as threads are swapped in and out. 0x280 bytes are used for this process, leaving the (currently known) deepest address of the stack to be 0x00399b10 after a context switch within _SifSendCmd. There are still 16 bytes to the end of the I/O thread's stack, but I assumed that it's close enough for a stack overflow. Each register preserved on the EE would take up 16 bytes, so minor changes to the code could result in that happening. There should be also some part of the I/O thread that is deeper than this (since an overflow actually happened), but I didn't want to spend even more time, trying to locate it. *** SIFRPC call from the IOP gets deadlocked: I don't have any dumps of the terminal for this case, but basically our homebrew PS2SDK used to lack the fix for iWakeupThread, which is used by the EE SIFRPC library to wake up the RPC server thread. It appears that iWakeupThread on the EE is bugged, whereby it will not increment the wakeup request counter of a thread if it is in RUN state. SONY had worked around this issue in their newer software. I got a first glimpse of this issue when I left the EE debugging features on, which eventually warned me about an invalid thread state for iWakeupThread. It's puzzling because RUN state wasn't documented to be an "invalid state". The telltale sign of the glitch happening, was the RPC server thread on the IOP waiting for a response from the EE. And the EE RPC server thread entering (and staying in) WAIT state. So I copied the workaround from SCE, which did the following: 1. If the current thread is not the running thread, call iWakeupThread. 2. If the current thread is the running thread, suspend it before calling iWakeupThread. Resume the thread afterwards. If the current thread is no longer the running thread, then there were other threads sharing the same priority as the running thread (and suspending/resuming the thread caused another thread to be preempted). Rotate the thread ready queue until the current thread once again becomes the running thread.
Running the HDD Browser on the T10000 This may interest some of you, but I was doing it to debug FHDB. For reasons, it stopped being able to boot my copy of the HDD Browser. I modified its ATAD module a very long time ago, so I am sure it works... Anyway, running my program that loads a decrypted copy of the HDD Browser from the HDD yields this ominous line via dsidb, while dsedb is stuck in a loop around SifGetReg: Code: loadmodule: fname rom0:SYSCLIB args 0 arg loadmodule: id 31, ret 1 loadmodule: fname rom0:UDNL args 11 arg img0: loadmodule: id 33, ret 1 UDNL returned 1 (not resident)! It can respond!? How's that possible? What happened, is that the T10000's late ROM had UDNL's device blacklist replaced with a whitelist, hence the HDD Browser's custom IOP reboot stops working because UDNL sees the "img0" device as an illegal device. Note that although only "img0:" is visible, the full argument is "img0: img1:". There is a NULL-terminator between arguments for IOP modules. To jump over this wall, I manually loaded UDNL: Code: dsidb R> mload rom0:UDNL img0: ...got its address: Code: dsidb R> mlist Id Begin End Size (Text Data Bss) Ver Name 1 830- 190f 10e0 1070 50 20 2.3 System_Memory_Manager ... 22 e7730- e958f 1e60 1cb0 1b0 0 0.0 ...and set a breakpoint on that evil function, before starting the module in debug mode: Code: dsidb R> bp e7730 $BP3=0x000e7730 init=0x1 curr=0x1 # enabled, auto-init dsidb R> mstart -d *** Exception at=00020004 v0-1=0000003c,00000069 a0-3=007fee7a,000e9401,00000069,000e7700 t0-7=00000018,00000002,00000002,00000002, 00000000,00000000,00000000,00000000 s0-7=007fee60,007fee68,007fee64,00000001, 007fedd8,007feda8,00000420,00000000 t8=00000000 t9=00000000 k0=000171d4 k1=00000000 gp=000f1580 sp=007fed98 fp=007fedf8 ra=000e79b8 lo=00000000 hi=00000000 PC=000e7730 bada=ffffffff $cr=0x00000024 [ CE0 Breakpoint ] $sr=0x00000404 [ IM0 IEp ] 0x000e7728: 0xafa5006c sw $a1,0x6c($sp) 0x000e772c: 0x08039f70 j 0x000e7dc0 # <+0x690> ->0x000e7730| 0x00803021 move $a2,$a0 <+0x04>: 0x000e7734: 0x24020020 li $v0,0x20 <+0x08>: 0x000e7738: 0x80c70000 lb $a3,0($a2) <+0x0c>: 0x000e773c: 0x00000000 nop <+0x10>: 0x000e7740: 0x10e2fffd beq $a3,$v0,0x000e7738 # <+0x08> dsidb S> It's a coincidence that the function exists at the very start of UDNL's text section. I cloned this module before, which was how I knew it exists there. As there are multiple images specified, it is easier to disable this function. This gets it to return immediately, with an "OK" as the return value. Code: dsidb S> as $PC jr $ra dsidb S> step dsidb S> as $PC addu $v0, $zero, $zero dsidb S> cont The browser loads on my DTL-T10000, with it being identified as a SCPH-10000. So easy, right? Now just sink in about 4 hours of trying to figure out why it did not work and add in cursing and swearing. I missed the part about "img0: img1:" and spent about 2-3 hours figuring out why its MCSERV module was missing (since its sceMcInit function was getting stuck at binding).