The ARM7TDMI is a great processor. My only gripe is that there is no divide instruction. Now, The inner-loop of a divide in ARM7TDMI code is 4 instructions so even if unrolled then it's going to take a Loooonnnggg time. OK, I know most people doing 3D on the thing will point out that drawing is the biggest overhead (especially with textures) but I'm looking at CELP speech decoding and for my purposes, it IS most decidedly the slowest part. I mean, you can find the MSB and index your jump into the code (in fact, if you do this, maybe you can have a 3-instructions per bit method) but even allowing for this, it may take 96 cycles + setup overheads. I did consider using a log/antilog table because I don't need totally perfect accuracy. Maybe combining the two will be fast and accurate.... input, anyone? Sean;-)
Here is the GNU implementation ; Enter with numbers in Ra and Rb. MOV Rcnt,#1 ; Bit to control the division. Div1 CMP Rb,#0x80000000 ; Move Rb until greater than Ra. CMPCC Rb,Ra MOVCC Rb,Rb,ASL#1 MOVCC Rcnt,Rcnt,ASL#1 BCC Div1 MOV Rc,#0 Div2 CMP Ra,Rb ; Test for possible subtraction. SUBCS Ra,Ra,Rb ; Subtract if ok, ADDCS Rc,Rc,Rcnt ; Put relevant bit into result MOVS Rcnt,Rcnt,LSR#1 ; Shift control bit MOVNE Rb,Rb,LSR#1 ; Halve unless finished. BNE Div2 ; Divide result in Rc, remainder in Ra. The Div1 loop can be replaced with a matrix of: CMP Ra,Rb,ASL (x) with branches The Div2 loop can similarly loose the CMP Ra,RB. Since it's unrolled and you branch into it at the right place, you can also loos MOVS Rcnt,Rcnt,LSR#1, the MOVNO Rb,Rb,LSR #1 and the BNE Div2 Therefore you end up with a lot of: AND rd,rb,#1 LSL (x) ADD rc,rc,rd So, instead if shifting, you bit-test Rb and place the result into the NEW register Rd. You would need 31 of these for a 16.16/16.16 divide and some normalisation at the end. I imagine a divide would then take a fixed amount of time. About 80 cycles at a guess...
Sorry, one one point I was talking rubbish. The number of cycles is NOT fixed. 80 would be maximum since it branches into the unrolled and pared down Div2 'loop'. Later tonight I will get it working (on paper at least) so I will be able to give a proper cycle-count and such. Of course, since the multiply gives a 64 bit result (if you want), then for most applications, the reSPAMSPAMSPAMSPAMSPAMcal of the values can be used and then only multiplies are needed. I remember talking to a representative of the 3DFX 3-D card manufacturers who explained that rather than a Z buffer, they used a W buffer, which was 1/Z.... so you multiply UP rather than divide down to find the Z... This was about 14 years ago and I know I was impressed at the time...
The W-buffer is actually a very good idea and concept. I don't know why it wasn't adopted more widely, I suppose it was probably too late in the development of the modern graphics pipline. Direct3D supported it and tried to mainstream it, I used it in the software reference rasterizer mode and it totally eliminated z-fighting (rounding errors and loss of precision in the zbuffer causing objects to overlap) If I was doing this, I'd probably just use a lookup table and some rough linear interpolation.
Thanks for the link, it's most helpful. Can anyone comment on the source I produced? Can it be made faster, for example? I know it doesn't give a remainder, but a quick multuply and subtract will do that, one assumes...
There are also a bunch of SWI's for division if I recall correctly. Dunno how costy those are. And try to avoid floats at all costs, since there's no FPU.