I am looking into compression schemes for use on handheld as well as home consoles. I really am looking for something that doesn't break the bank on CPU usage for decode, but encode (as long as it's not months>) then that's OK. I've looed into GSM (European Mobile Phone Standard) which is 13.2 Kbit per second CELP (lots of flavours) typically 1.4 KBit per second (US Mobiles, I think) LPC10 (speak snd spell anyone?) 0.6 KBit per second I have heard (no pun intended) that CELP can be variable rate and other such changes. Does anyone know of an aysmetrical implementation of this model?
I know tepples over at gbadev/dsdev.org have made some experiments with audio compression on the gba/nds and made a GSM player and a ADPCM player for gba (http://www.pineight.com/gba/). Maybe he could help?
I found this paper containing some information but no program code, maybe its enough for you ? http://devzone.wsnw.net/downloads/pdf/00540893.pdf
Unless you have severely limited space I would say go with ADPCM (any table) for the quality, because it's simple to encode and decode, because it's built into most consoles and because you probably should be using ADPCM already for non-voice parts of your game. How many seconds of phoneme do you have?
There may not be that MUCH speech, but for example the European version will have: English French German Italian possibly some more so each saving is 4+ savings. I know that CELP decompression using an 8-dimentional lookup table works on GBC...
what about Speech synthesis? Thought I've never coded one myself so I don't know how much work would be involved. And it could probably be costly on the cpu when I think about it.. but it saves a lot of space at least
Hee hee, I don't think speech synthesis (which is actually LCP10 snippets (formants)) is needed... Hard to get any emotion in synthesis
My current thrust in testing OK, so I'm going down the CELP route. Speech sampled at 8KHz with 20ms frame size & 5ms code selection. I will use 512 lookups (256 fixed, 256 dynamic?) Frame Size = 160 bytes codebook index = 9x4 bits pitch delay = 8x4 bits pitch filter co-efficient = 5x4 bits gain = 5x4 bits LP co-efficient = 10x5 bits Four of the above per frame = 79 bytes. 79x50 (frames per second) = 3950 bytes per second. = 3.86 Kbit/Second. A lot of work needs doing in training the fixed lookups. If anyone has some 'hot' information on training stratergies, I would like to hear from them. When I get this prototyped on the PC, I will put up some examples so people can hear what it's like... Wonder if there would be a market for this. The decode uses a lot of divide instructions so I will use fixed-point. Still, it's a serious issue since the ARM7 TDMI as used in the GBA doesn't have a divide instruction. Has anyone worked out an optimal divide for this processor? The condition codes avoid branching, but it's still going to eat quite a few cycles. I'm assuming the FASTEST divide is <shift,conditional subtract> repeated 32 times (so 64 cycles)? I'm not sure, but I wonder (at the expense of space) if I could use a rather bulky table of logarithms? The LP coefficients could be made smaller using DCPM or VQ but the papers I read say that this reduced quality an awful lot.
Advances in CELP and possible video-codec... I'm now struggling with the use of a mixed fixed/dynamic codebook. 256+256 seems the optimal split, but re-assigning the dynamic codes is a tricky one. Using data from previous frames is obvious, but maybe a bit could be spared to decide IF an entry should be used. But how to allocate? Simply removing the oldest is the easiest, but not the best. Maybe another 8 bits to decide? Extra bits, yes, but if the overall bit-rate can be lowered? Remember, compression time is not really an issue, but the decode needs to be FAST. Surely someone else has considered this type of compression for a game? I wonder, is CELP patented, or just certain concepts? If I could get this working on an ARM7TDMI @ 16.76MHz (i.e. a GBA cpu) then I imagine it might be a commercial product. I'm also considering a video codec based on the MP4 ideas. Has anyone worked on wavelets for video-compression yet? Please, any input would be gladly recieved, I need people to bounce these ideas off... Thanks, Sean;-)
OGG has a experimental codec called Tarkin that is supposed to use wavelets but according to this wikipedia article it has been put on hold: http://en.wikipedia.org/wiki/Tarkin_(codec)#Ogg_codecs
I'm thinking of using reSPAMSPAMSPAMSPAMSPAMcals for a lot of the data so divides can be swapped for multiplies, whick makes for a happier CPU-usage profile...
This is the basic principal. Like I ATTEMPTED to say before. I think maybe I can store the r e c i p r o c a l s and use multiplies rather than divides.
MIPS for coding & decoding speech codecs CPU cycles per second for 8KHz sample rate sound. Compression % is compared to 16 bit PCM. encode decode compression u-law: 42K 40K 50% ADPMC: 407K 330K 75% GSM: 2.0M 950K 89.7% LPC: 2.5M 1.0M 96.3% CELP 4.5K: 24-52M* 4.4M 96.5% CELP 3.0K: 25-47M* 4.0M 97.7% LPC-10: 6.4M 3.5M 98.1% CELP 2.3K: 24-45M* 3.8M 98.2% OpenLPC 1.8K: 2.9M 1.8M 98.6% OpenLPC 1.4K: 2.9M 1.9M 98.9% *Note on CELP encoding: CELP uses a codebook of 256 speech patterns. The CELP encoding performance listed shows figures from a codebook search of 32 up to the full 256 entries. I intend to use a 512 entry codebook with 256 fixed & 256 dynamic entries. That will require a LOT of CPU horsepower to encode but as the example shows, 4.4MIPS to decode. I imagine that I can speed up the decode somewhat at the expense of a little of the compression ratio. For the DS, I would like to allow real-time encoding as well. With such a low bit-rate, you could have an 8-player game with everyone speaking at once (which would be nice) but, obviously, the game would then have to run on the ARM7 but with 3D fill hardware, what do people run out of first? CPU power or draw-cycles? Oh, and codebook size improves quality but increases coding MIPS, not a problem for simple decompression. I may well go for a 1024 entry codebook. I could organized the real-time encoder to use 128 or 256 by placing the most general codes at the beginning of the table. Stocastic learned vectors seems like the way to go, but I need the right start-points for the training. I guess I will have to run a lot of different samples through it and average...
Of course, since this is going to be coming from an error-proof source (the ROM), I can remove the hamming-code used by CELP (15,11). Doesn't save much CPU time, but it reduces the size of the data somewhat...
I am in the midst of recoding the US standard CELP to work in fixed point. I'm thinking I will use a 1024 entry fixed codebook. The codebook will be generated stocastically. I think that a codebook per speaker might work well. Adds some bulk, but if you are getting speech at 600 bytes per second and you put a LOT into it then it will be a drop in the ocean. I hope I'm not reinventing the wheel here... Oh, I think I can get rid of a lot of divides (if not all) by replacing them with cross multiplication. Must watch for overflows....
It's going well.... Well, I have a fixed-point version of the CELP code working. I settled on a 1024 entry fixed table, so it takes a long time to compress (like 5 seconds per 1 second on a 2GHz PC) but the decode should be fine on a GBA. As my first demo, I'm going to sample William S. Burroughs reading his book 'Junky' which comes on 3 CDs. It will use 6MBytes so the extra space will go on some images, spot-FX and a transcript of the book. It might be a good way to learn English. If I provide the transcript, would people volenteer to convert into French & German (a must) but ideally also Dutch, Danich, Spanish, Italian, Russian & so on. With a simple Huffman-code, the text will average about 2.5 bits per character.
Cool. You shouldn't worry about the encoding taking long time, rather that then the decoding in-game taking long time