Asymetrical Speech Compression

Piglet · Jun 4, 2008

I am looking into compression schemes for use on handheld as well as home consoles. I really am looking for something that doesn't break the bank on CPU usage for decode, but encode (as long as it's not months>) then that's OK.

I've looed into

GSM (European Mobile Phone Standard) which is 13.2 Kbit per second
CELP (lots of flavours) typically 1.4 KBit per second (US Mobiles, I think)
LPC10 (speak snd spell anyone?) 0.6 KBit per second

I have heard (no pun intended) that CELP can be variable rate and other such changes. Does anyone know of an aysmetrical implementation of this model?

babu · Jun 4, 2008

I know tepples over at gbadev/dsdev.org have made some experiments with audio compression on the gba/nds and made a GSM player and a ADPCM player for gba (http://www.pineight.com/gba/). Maybe he could help?

devzone · Jun 4, 2008

I found this paper containing some information but no program code, maybe its enough for you ?

http://devzone.wsnw.net/downloads/pdf/00540893.pdf

Calpis · Jun 4, 2008

Unless you have severely limited space I would say go with ADPCM (any table) for the quality, because it's simple to encode and decode, because it's built into most consoles and because you probably should be using ADPCM already for non-voice parts of your game. How many seconds of phoneme do you have?

Piglet · Jun 5, 2008

There may not be that MUCH speech, but for example the European version will have:

English
French
German
Italian

possibly some more so each saving is 4+ savings. I know that CELP decompression using an 8-dimentional lookup table works on GBC...

babu · Jun 5, 2008

what about Speech synthesis?
Thought I've never coded one myself so I don't know how much work would be involved. And it could probably be costly on the cpu when I think about it.. but it saves a lot of space at least

Piglet · Jun 6, 2008

Hee hee, I don't think speech synthesis (which is actually LCP10 snippets (formants)) is needed... Hard to get any emotion in synthesis

Calpis · Jun 6, 2008

IIRC Tokimeki Memorial 2 had good speech synthesis.

Piglet · Jun 7, 2008

My current thrust in testing

OK, so I'm going down the CELP route. Speech sampled at 8KHz with 20ms frame size & 5ms code selection. I will use 512 lookups (256 fixed, 256 dynamic?)

Frame Size = 160 bytes

codebook index = 9x4 bits
pitch delay = 8x4 bits
pitch filter co-efficient = 5x4 bits
gain = 5x4 bits
LP co-efficient = 10x5 bits

Four of the above per frame = 79 bytes.
79x50 (frames per second) = 3950 bytes per second.

= 3.86 Kbit/Second.

A lot of work needs doing in training the fixed lookups. If anyone has some 'hot' information on training stratergies, I would like to hear from them. When I get this prototyped on the PC, I will put up some examples so people can hear what it's like...

Wonder if there would be a market for this. The decode uses a lot of divide instructions so I will use fixed-point. Still, it's a serious issue since the ARM7 TDMI as used in the GBA doesn't have a divide instruction. Has anyone worked out an optimal divide for this processor? The condition codes avoid branching, but it's still going to eat quite a few cycles. I'm assuming the FASTEST divide is <shift,conditional subtract> repeated 32 times (so 64 cycles)? I'm not sure, but I wonder (at the expense of space) if I could use a rather bulky table of logarithms?

The LP coefficients could be made smaller using DCPM or VQ but the papers I read say that this reduced quality an awful lot.

Piglet · Jun 8, 2008

Advances in CELP and possible video-codec...

I'm now struggling with the use of a mixed fixed/dynamic codebook. 256+256 seems the optimal split, but re-assigning the dynamic codes is a tricky one. Using data from previous frames is obvious, but maybe a bit could be spared to decide IF an entry should be used. But how to allocate? Simply removing the oldest is the easiest, but not the best. Maybe another 8 bits to decide? Extra bits, yes, but if the overall bit-rate can be lowered?

Remember, compression time is not really an issue, but the decode needs to be FAST.

Surely someone else has considered this type of compression for a game?

I wonder, is CELP patented, or just certain concepts? If I could get this working on an ARM7TDMI @ 16.76MHz (i.e. a GBA cpu) then I imagine it might be a commercial product.

I'm also considering a video codec based on the MP4 ideas. Has anyone worked on wavelets for video-compression yet?

Please, any input would be gladly recieved, I need people to bounce these ideas off...

Thanks,
Sean;-)

babu · Jun 8, 2008

OGG has a experimental codec called Tarkin that is supposed to use wavelets but according to this wikipedia article it has been put on hold:
http://en.wikipedia.org/wiki/Tarkin_(codec)#Ogg_codecs

Piglet · Jun 9, 2008

I'm thinking of using reSPAMSPAMSPAMSPAMSPAMcals for a lot of the data so divides can be swapped for multiplies, whick makes for a happier CPU-usage profile...

Piglet · Jun 9, 2008

This is the basic principal. Like I ATTEMPTED to say before. I think maybe I can store the r e c i p r o c a l s and use multiplies rather than divides.

devzone · Jun 9, 2008

admin: why is "c i p r o" marked as spam here ?

Piglet · Jun 14, 2008

MIPS for coding & decoding speech codecs

CPU cycles per second for 8KHz sample rate sound.
Compression % is compared to 16 bit PCM.

encode decode compression
u-law: 42K 40K 50%
ADPMC: 407K 330K 75%
GSM: 2.0M 950K 89.7%
LPC: 2.5M 1.0M 96.3%
CELP 4.5K: 24-52M* 4.4M 96.5%
CELP 3.0K: 25-47M* 4.0M 97.7%
LPC-10: 6.4M 3.5M 98.1%
CELP 2.3K: 24-45M* 3.8M 98.2%
OpenLPC 1.8K: 2.9M 1.8M 98.6%
OpenLPC 1.4K: 2.9M 1.9M 98.9%

*Note on CELP encoding: CELP uses a codebook
of 256 speech patterns. The CELP encoding
performance listed shows figures from a codebook
search of 32 up to the full 256 entries.

I intend to use a 512 entry codebook with 256 fixed & 256 dynamic
entries. That will require a LOT of CPU horsepower to encode but
as the example shows, 4.4MIPS to decode. I imagine that I can
speed up the decode somewhat at the expense of a little of the
compression ratio.

For the DS, I would like to allow real-time encoding as well. With
such a low bit-rate, you could have an 8-player game with everyone
speaking at once (which would be nice) but, obviously, the game
would then have to run on the ARM7 but with 3D fill hardware, what
do people run out of first? CPU power or draw-cycles?

Oh, and codebook size improves quality but increases coding MIPS, not a problem for simple decompression. I may well go for a 1024 entry codebook. I could organized the real-time encoder to use 128 or 256 by placing the most general codes at the beginning of the table. Stocastic learned vectors seems like the way to go, but I need the right start-points for the training. I guess I will have to run a lot of different samples through it and average...

Piglet · Jun 14, 2008

Of course, since this is going to be coming from an error-proof source (the ROM), I can remove the hamming-code used by CELP (15,11). Doesn't save much CPU time, but it reduces the size of the data somewhat...

Piglet · Jun 14, 2008

I am in the midst of recoding the US standard CELP to work in fixed point. I'm thinking I will use a 1024 entry fixed codebook. The codebook will be generated stocastically. I think that a codebook per speaker might work well. Adds some bulk, but if you are getting speech at 600 bytes per second and you put a LOT into it then it will be a drop in the ocean.

I hope I'm not reinventing the wheel here...

Oh, I think I can get rid of a lot of divides (if not all) by replacing them with cross multiplication. Must watch for overflows....

ASSEMbler · Jun 14, 2008

devzone said:

admin: why is "c i p r o" marked as spam here ?
Click to expand...

Name of a ill spammers loved to sell here. It sucks, but we used to get so much spam here.

Piglet · Jun 15, 2008

It's going well....

Well, I have a fixed-point version of the CELP code working. I settled on a 1024 entry fixed table, so it takes a long time to compress (like 5 seconds per 1 second on a 2GHz PC) but the decode should be fine on a GBA.

As my first demo, I'm going to sample William S. Burroughs reading his book 'Junky' which comes on 3 CDs. It will use 6MBytes so the extra space will go on some images, spot-FX and a transcript of the book.

It might be a good way to learn English. If I provide the transcript, would people volenteer to convert into French & German (a must) but ideally also Dutch, Danich, Spanish, Italian, Russian & so on.

With a simple Huffman-code, the text will average about 2.5 bits per character.

babu · Jun 15, 2008

Cool.
You shouldn't worry about the encoding taking long time, rather that then the decoding in-game taking long time

Asymetrical Speech Compression

Piglet Spirited Member

babu Mamihlapinatapai

devzone Robust Member

Calpis Champion of the Forum

Piglet Spirited Member

babu Mamihlapinatapai

Piglet Spirited Member

Calpis Champion of the Forum

Piglet Spirited Member

Piglet Spirited Member

babu Mamihlapinatapai

Piglet Spirited Member

Piglet Spirited Member

devzone Robust Member

Piglet Spirited Member

Piglet Spirited Member

Piglet Spirited Member

ASSEMbler Administrator Staff Member

Piglet Spirited Member

babu Mamihlapinatapai

Share This Page

Asymetrical Speech Compression

Piglet Spirited Member

babu Mamihlapinatapai

devzone Robust Member

Calpis Champion of the Forum

Piglet Spirited Member

babu Mamihlapinatapai

Piglet Spirited Member

Calpis Champion of the Forum

Piglet Spirited Member

Piglet Spirited Member

babu Mamihlapinatapai

Piglet Spirited Member

Piglet Spirited Member

devzone Robust Member

Piglet Spirited Member

Piglet Spirited Member

Piglet Spirited Member

ASSEMbler Administrator Staff Member

Piglet Spirited Member

babu Mamihlapinatapai

Share This Page

Useful Searches