Divides on the ARM7TDMI

Discussion in 'Nintendo Game Development' started by Piglet, Jun 9, 2008.

  1. Piglet

    Piglet Spirited Member

    Joined:
    May 28, 2008
    Messages:
    175
    Likes Received:
    0
    The ARM7TDMI is a great processor. My only gripe is that there is no divide instruction. Now, The inner-loop of a divide in ARM7TDMI code is 4 instructions so even if unrolled then it's going to take a Loooonnnggg time. OK, I know most people doing 3D on the thing will point out that drawing is the biggest overhead (especially with textures) but I'm looking at CELP speech decoding and for my purposes, it IS most decidedly the slowest part.

    I mean, you can find the MSB and index your jump into the code (in fact, if you do this, maybe you can have a 3-instructions per bit method) but even allowing for this, it may take 96 cycles + setup overheads.

    I did consider using a log/antilog table because I don't need totally perfect accuracy. Maybe combining the two will be fast and accurate....

    input, anyone?

    Sean;-)
     
  2. Piglet

    Piglet Spirited Member

    Joined:
    May 28, 2008
    Messages:
    175
    Likes Received:
    0
    Here is the GNU implementation

    ; Enter with numbers in Ra and Rb.
    MOV Rcnt,#1 ; Bit to control the division.
    Div1

    CMP Rb,#0x80000000 ; Move Rb until greater than Ra.
    CMPCC Rb,Ra
    MOVCC Rb,Rb,ASL#1
    MOVCC Rcnt,Rcnt,ASL#1
    BCC Div1
    MOV Rc,#0

    Div2
    CMP Ra,Rb ; Test for possible subtraction.
    SUBCS Ra,Ra,Rb ; Subtract if ok,
    ADDCS Rc,Rc,Rcnt ; Put relevant bit into result
    MOVS Rcnt,Rcnt,LSR#1 ; Shift control bit
    MOVNE Rb,Rb,LSR#1 ; Halve unless finished.
    BNE Div2 ; Divide result in Rc, remainder in Ra.

    The Div1 loop can be replaced with a matrix of:

    CMP Ra,Rb,ASL (x)

    with branches

    The Div2 loop can similarly loose the CMP Ra,RB. Since it's unrolled and you branch into it at the right place, you can also loos MOVS Rcnt,Rcnt,LSR#1, the MOVNO Rb,Rb,LSR #1 and the BNE Div2

    Therefore you end up with a lot of:

    AND rd,rb,#1 LSL (x)
    ADD rc,rc,rd

    So, instead if shifting, you bit-test Rb and place the result into the NEW register Rd. You would need 31 of these for a 16.16/16.16 divide and some normalisation at the end. I imagine a divide would then take a fixed amount of time. About 80 cycles at a guess...
     
  3. Piglet

    Piglet Spirited Member

    Joined:
    May 28, 2008
    Messages:
    175
    Likes Received:
    0
    Sorry, one one point I was talking rubbish. The number of cycles is NOT fixed. 80 would be maximum since it branches into the unrolled and pared down Div2 'loop'. Later tonight I will get it working (on paper at least) so I will be able to give a proper cycle-count and such.

    Of course, since the multiply gives a 64 bit result (if you want), then for most applications, the reSPAMSPAMSPAMSPAMSPAMcal of the values can be used and then only multiplies are needed.

    I remember talking to a representative of the 3DFX 3-D card manufacturers who explained that rather than a Z buffer, they used a W buffer, which was 1/Z.... so you multiply UP rather than divide down to find the Z...

    This was about 14 years ago and I know I was impressed at the time...
     
  4. marshallh

    marshallh N64 Coder

    Joined:
    Mar 16, 2006
    Messages:
    661
    Likes Received:
    26
    The W-buffer is actually a very good idea and concept. I don't know why it wasn't adopted more widely, I suppose it was probably too late in the development of the modern graphics pipline.

    Direct3D supported it and tried to mainstream it, I used it in the software reference rasterizer mode and it totally eliminated z-fighting (rounding errors and loss of precision in the zbuffer causing objects to overlap)

    If I was doing this, I'd probably just use a lookup table and some rough linear interpolation.
     
  5. DevL

    DevL Robust Member

    Joined:
    Jul 9, 2006
    Messages:
    213
    Likes Received:
    0
    are we talking in DS context here ?
    Hint May be we can count on Co-Processor in divide...
     
  6. Piglet

    Piglet Spirited Member

    Joined:
    May 28, 2008
    Messages:
    175
    Likes Received:
    0
    I'm thinking of the GBA with it's ARM7TDMI... so no FPU.
     
  7. DevL

    DevL Robust Member

    Joined:
    Jul 9, 2006
    Messages:
    213
    Likes Received:
    0
  8. Piglet

    Piglet Spirited Member

    Joined:
    May 28, 2008
    Messages:
    175
    Likes Received:
    0
    Thanks for the link, it's most helpful. Can anyone comment on the source I produced? Can it be made faster, for example? I know it doesn't give a remainder, but a quick multuply and subtract will do that, one assumes...
     
  9. blasty

    blasty Member

    Joined:
    Jul 18, 2006
    Messages:
    5
    Likes Received:
    0
    There are also a bunch of SWI's for division if I recall correctly. Dunno how costy those are. And try to avoid floats at all costs, since there's no FPU.
     
sonicdude10
Draft saved Draft deleted
Insert every image as a...
  1.  0%

Share This Page