DECI2 debugging adventures

sp193 · Feb 11, 2017

To give the newer developers out there some ideas of what can be done with DECI2, I thought that I would share some of the recent bug-hunting cases. I usually don't gather data from my (long) debugging sessions because I am usually busy with cursing at the code... but today I recorded some of it down!

It's a boring history lesson, but I guess that it may help if you're struggling to understand what a TOOL can be used for.

For debugging, I have debugging information left intact (don't use ee-strip, use the -g option for compiling and don't pass -s to the linker). As DECI2 will clear debugging information when the IOP is reboot, I issued this command into dsidb before running HDLGameInstaller (HDLGameInstaller had its IOP reboot process disabled):
Code:
mstart rom0:UDNL
This will cause the IOP to be reboot (and hence the DebugStation mode option will be effective), but yet the EE debugging information will be left intact.

On 2017/02/11, a prototype version of HDLGameInstaller would randomly hang. After leaving to install games on a TOOL, this is visible:
Code:
# EEKERNEL[SignalSema] : Semaphore maxCount overflow.
#   SemaphoreID:0 count:2 maxCount:1
         
*** Unexpected reply - type=BREAKR result=EXCEPTION
*** Target program stopped. Check the location by dr command.
dsedb S> dr
 at=80020000  v0-1=00000000,00000108  a0-3=00000000,00000185,00000004,00000044
 t0-7=00000000,a00269d0,00000040,a0026980, 20486f80,a0000000,00000004,0000001f
 s0-7=00040000,00040000,80025980,004a72e0, 00a04940,00030400,00000080,00030400
 t8=80025980 t9=00030400   k0=80016ed8 k1=ffff8000   gp=00396770 sp=00399f30
 fp=00399fe0 ra=00202678   lo=00000000 hi=00000001   sa=00000002 PC=002b4e28
 badvaddr=00000228 badpaddr=018a8400
 $cause   = 0x00038424 [ CE0 EXC2=Debug IP7 IP2 EXC="Breakpoint" ]
 $status  = 0x70030c13 [ Cu210 EDI EIE IM3 IM2 KSU=User EXL IE ]
   <SignalSema>:
  0x002b4e20: 0x24030042  li      $v1,0x42
  0x002b4e24: 0x0000000c  syscall 0x0
->0x002b4e28: 0x0000000d  break   0x0
  0x002b4e2c: 0x00000000  nop
   <WaitSema>:
  0x002b4e30: 0x24030044  li      $v1,0x44
  0x002b4e34: 0x0000000c  syscall 0x0
  0x002b4e38: 0x03e00008  jr      $ra
Oddly, semaphore 0 isn't used by the I/O thread (the thread in which the exception occurred).
It appears that something has overwritten the IOState structure, which the I/O thread uses to store the IDs of the semaphores that it uses.

IOState structure declaration in C:
Code:
struct IOState {
   void *buffer;
   void *unpackbuf;
   struct BuffDesc *bd;
   unsigned short int WritePtr, ReadPtr;
   unsigned short int bufmax, nbufs;
   unsigned char state, command;
   unsigned int remaining;
   unsigned int opt;
   int id, CmdAckSema, inBufSema, outBufSema, ioFD;
};
Healthy IOState structure:
Code:
dsedb R> dw IOState
 0x00399ac0: 0x00a04940 0x003b3940  0x004a72e0 0x00010001
 0x00399ad0: 0x00030080 0x00000100  0x00000000 0x00000000
 0x00399ae0: 0x0000001c 0x00000012  0x00000013 0x00000014
 0x00399af0: 0x00000007 0x00000000  0x00000000 0x00000000
 0x00399b00: 0x00000000 0x00000000  0x00000000 0x00000000
 0x00399b10: 0x00000002 0x01000001  0x00000000 0x00000000
 0x00399b20: 0x70070c00 0x00000000  0x00000000 0x00000000
 0x00399b30: 0x00000000 0x00000000  0x00000000 0x00000000
 0x00399b40: 0x00010000 0x00000000  0x9c2dc000 0xfc017c00
 0x00399b50: 0x00000005 0x01000001  0x00000000 0x00000000
 0x00399b60: 0x80020000 0xffffffff  0x03027559 0x02902140
 0x00399b70: 0x0000000a 0x00000000  0x00000000 0x00000000
 0x00399b80: 0x00000104 0x00000000  0x00000000 0x00000000
 0x00399b90: 0x0000000a 0x00000000  0x00000000 0x00000000
 0x00399ba0: 0x00000185 0x00000000  0x00000000 0x00000000
 0x00399bb0: 0x00000004 0x00000000  0x00000000 0x00000000
Corrupted structure:
Code:
dsedb S> dw IOState
 0x00399ac0: 0x003a0000 0x00000000  0x00000000 0x00000001
 0x00399ad0: 0x00399d30 0x00000000  0x03027559 0x02902140
 0x00399ae0: 0x20486f80 0x00000000  0x00000000 0x00000000
 0x00399af0: 0x00000040 0x00000000  0x00000000 0x00000000
 0x00399b00: 0x20486f80 0x00000000  0x00000000 0x00000000
 0x00399b10: 0x00000040 0x00000000  0x00000000 0x00000000
 0x00399b20: 0x00478fc0 0x00000000  0x00000000 0x00000000
 0x00399b30: 0x0005cda0 0x00000000  0x00000000 0x00000000
 0x00399b40: 0x00000010 0x00000000  0x00000000 0x00000000
 0x00399b50: 0x00000002 0x01000001  0x00000000 0x00000000
 0x00399b60: 0x80020000 0xffffffff  0x00000000 0x00000000
 0x00399b70: 0x00000017 0x00000000  0x00000000 0x00000000
 0x00399b80: 0x00000104 0x00000000  0x03027559 0x02902140
 0x00399b90: 0x00000017 0x00000000  0x00000000 0x00000000
 0x00399ba0: 0x00000185 0x00000000  0x00000000 0x00000000
 0x00399bb0: 0x00000004 0x00000000  0x00000000 0x00000000
IOState exists at 0x00399ac0, while the stack of the I/O thread begins at 0x00399b00. It's worth nothing that when transfers over the network by HDLGameInstaller begin, the visible region starting from 0x00399b00 actually contains zeros. This means that nearly all (or perhaps even more than) the stack was used.

At first, I didn't see a pattern to the garbage values, until I noticed some values that looked like function addresses (i.e. 0x20486f80, which was the address of one of the SIFRPC functions). I verified that they were addresses, with the di command (di 0x00486f80).

Because the I/O thread's stack exists right after the IOState structure, it was my one and only (valid) suspect. As putting a hardware breakpoint (with hbp) at the start of the IOState structure resulted in lots of unnecessary breaks, I put it at 0x00399b10 instead - close to the end of the I/O thread's stack.

Eventually, the kernel can be seen writing there:
Code:
dsedb S> dr
 at=00000003  v0-1=6c180002,b000c400  a0-3=00000185,00000185,00000004,00000044
 t0-7=00000000,a0025c50,00000040,a0025c00, 20486f80,a0000000,00000004,0000001f
 s0-7=00000040,20486f80,00000050,00000002, 0005cda0,00478fc0,00000000,00234420
 t8=80025980 t9=00030400   k0=00399b10 k1=ffff8000   gp=00396770 sp=8001d480
 fp=00478fc0 ra=80003d04   lo=00000001 hi=00000000   sa=00000003 PC=80003e5c
 badvaddr=00000228 badpaddr=018a8400
 $cause   = 0x00038020 [ CE0 EXC2=Debug IP7 EXC="SYSCALL" ]
 $status  = 0x70030c04 [ Cu210 EDI EIE IM3 IM2 KSU=Kernel ERL ]
  0x80003e54: 0xe75f027c  swc1    $fpr31,0x27c($k0)
  0x80003e58: 0x00000828  mfsa    $at
->0x80003e5c: 0xaf410000  sw      $at,0($k0)
  0x80003e60: 0x4441f800  cfc1    $at,$fcr31
  0x80003e64: 0xaf410004  sw      $at,4($k0)
  0x80003e68: 0x3c018000  lui     $at,0x8000
  0x80003e6c: 0x44810800  mtc1    $at,$fpr1
I put a breakpoint right after the start of fileXioWrite, which I believed to be one of the deepest parts of the I/O thread. The stack was at 0x00399f00, which was still quite far from the end of the stack.
But continuing down fileXioWrite, the numbers started to add up:
SifCallRpc: +0xA0 bytes
SifSendCmd: +0x10 bytes
_SifSendCmd: +0xC0 bytes
Total: 0x170

That would leave the stack at 0x00399d90, within _SifsendCmd. It's still quite some distance from IOState, but the PCSX2 FPS2BIOS code shows that the EE kernel saves the contexts of each thread onto their stacks as threads are swapped in and out. 0x280 bytes are used for this process, leaving the (currently known) deepest address of the stack to be 0x00399b10 after a context switch within _SifSendCmd.

There are still 16 bytes to the end of the I/O thread's stack, but I assumed that it's close enough for a stack overflow. Each register preserved on the EE would take up 16 bytes, so minor changes to the code could result in that happening. There should be also some part of the I/O thread that is deeper than this (since an overflow actually happened), but I didn't want to spend even more time, trying to locate it.

***
SIFRPC call from the IOP gets deadlocked:

I don't have any dumps of the terminal for this case, but basically our homebrew PS2SDK used to lack the fix for iWakeupThread, which is used by the EE SIFRPC library to wake up the RPC server thread.
It appears that iWakeupThread on the EE is bugged, whereby it will not increment the wakeup request counter of a thread if it is in RUN state. SONY had worked around this issue in their newer software.

I got a first glimpse of this issue when I left the EE debugging features on, which eventually warned me about an invalid thread state for iWakeupThread. It's puzzling because RUN state wasn't documented to be an "invalid state".
The telltale sign of the glitch happening, was the RPC server thread on the IOP waiting for a response from the EE. And the EE RPC server thread entering (and staying in) WAIT state.

So I copied the workaround from SCE, which did the following:
1. If the current thread is not the running thread, call iWakeupThread.
2. If the current thread is the running thread, suspend it before calling iWakeupThread. Resume the thread afterwards. If the current thread is no longer the running thread, then there were other threads sharing the same priority as the running thread (and suspending/resuming the thread caused another thread to be preempted). Rotate the thread ready queue until the current thread once again becomes the running thread.

sp193 · Jun 6, 2018

Running the HDD Browser on the T10000

This may interest some of you, but I was doing it to debug FHDB. For reasons, it stopped being able to boot my copy of the HDD Browser. I modified its ATAD module a very long time ago, so I am sure it works...

Anyway, running my program that loads a decrypted copy of the HDD Browser from the HDD yields this ominous line via dsidb, while dsedb is stuck in a loop around SifGetReg:
Code:
loadmodule: fname rom0:SYSCLIB args 0 arg
loadmodule: id 31, ret 1
loadmodule: fname rom0:UDNL args 11 arg img0:
loadmodule: id 33, ret 1
UDNL returned 1 (not resident)! It can respond!? How's that possible? What happened, is that the T10000's late ROM had UDNL's device blacklist replaced with a whitelist, hence the HDD Browser's custom IOP reboot stops working because UDNL sees the "img0" device as an illegal device.

Note that although only "img0:" is visible, the full argument is "img0: img1:".
There is a NULL-terminator between arguments for IOP modules.

To jump over this wall, I manually loaded UDNL:
Code:
dsidb R> mload rom0:UDNL img0:
...got its address:
Code:
dsidb R> mlist
 Id  Begin    End  Size (Text  Data   Bss) Ver  Name
  1    830-  190f  10e0  1070    50    20  2.3  System_Memory_Manager
  ...
 22  e7730- e958f  1e60  1cb0   1b0     0  0.0 
...and set a breakpoint on that evil function, before starting the module in debug mode:
Code:
dsidb R> bp e7730            
 $BP3=0x000e7730 init=0x1 curr=0x1 # enabled, auto-init
dsidb R> mstart -d
*** Exception
 at=00020004  v0-1=0000003c,00000069  a0-3=007fee7a,000e9401,00000069,000e7700
 t0-7=00000018,00000002,00000002,00000002, 00000000,00000000,00000000,00000000
 s0-7=007fee60,007fee68,007fee64,00000001, 007fedd8,007feda8,00000420,00000000
 t8=00000000 t9=00000000   k0=000171d4 k1=00000000   gp=000f1580 sp=007fed98
 fp=007fedf8 ra=000e79b8   lo=00000000 hi=00000000   PC=000e7730 bada=ffffffff
 $cr=0x00000024 [ CE0 Breakpoint ]
 $sr=0x00000404 [ IM0 IEp ]
  0x000e7728: 0xafa5006c  sw      $a1,0x6c($sp)
  0x000e772c: 0x08039f70  j       0x000e7dc0  # <+0x690>
->0x000e7730| 0x00803021  move    $a2,$a0
   <+0x04>:
  0x000e7734: 0x24020020  li      $v0,0x20
   <+0x08>:
  0x000e7738: 0x80c70000  lb      $a3,0($a2)
   <+0x0c>:
  0x000e773c: 0x00000000  nop
   <+0x10>:
  0x000e7740: 0x10e2fffd  beq     $a3,$v0,0x000e7738  # <+0x08>
dsidb S>
It's a coincidence that the function exists at the very start of UDNL's text section. I cloned this module before, which was how I knew it exists there.

As there are multiple images specified, it is easier to disable this function.
This gets it to return immediately, with an "OK" as the return value.
Code:
dsidb S> as $PC jr $ra
dsidb S> step
dsidb S> as $PC addu $v0, $zero, $zero
dsidb S> cont
The browser loads on my DTL-T10000, with it being identified as a SCPH-10000.

So easy, right? Now just sink in about 4 hours of trying to figure out why it did not work and add in cursing and swearing.
I missed the part about "img0: img1:" and spent about 2-3 hours figuring out why its MCSERV module was missing (since its sceMcInit function was getting stuck at binding).

DECI2 debugging adventures

sp193 Site Soldier

sp193 Site Soldier

Share This Page

DECI2 debugging adventures

sp193 Site Soldier

sp193 Site Soldier

Share This Page

Useful Searches