Asymetrical Speech Compression

Discussion in 'Game Development General Discussion' started by Piglet, Jun 4, 2008.

  1. Piglet

    Piglet Spirited Member

    Joined:
    May 28, 2008
    Messages:
    175
    Likes Received:
    0
    I am looking into compression schemes for use on handheld as well as home consoles. I really am looking for something that doesn't break the bank on CPU usage for decode, but encode (as long as it's not months>) then that's OK.

    I've looed into

    GSM (European Mobile Phone Standard) which is 13.2 Kbit per second
    CELP (lots of flavours) typically 1.4 KBit per second (US Mobiles, I think)
    LPC10 (speak snd spell anyone?) 0.6 KBit per second

    I have heard (no pun intended) that CELP can be variable rate and other such changes. Does anyone know of an aysmetrical implementation of this model?
     
  2. babu

    babu Mamihlapinatapai

    Joined:
    Apr 15, 2005
    Messages:
    2,945
    Likes Received:
    3
    I know tepples over at gbadev/dsdev.org have made some experiments with audio compression on the gba/nds and made a GSM player and a ADPCM player for gba (http://www.pineight.com/gba/). Maybe he could help?
     
    Last edited: Jun 4, 2008
  3. devzone

    devzone Robust Member

    Joined:
    May 20, 2008
    Messages:
    248
    Likes Received:
    1
  4. Calpis

    Calpis Champion of the Forum

    Joined:
    Mar 13, 2004
    Messages:
    5,906
    Likes Received:
    21
    Unless you have severely limited space I would say go with ADPCM (any table) for the quality, because it's simple to encode and decode, because it's built into most consoles and because you probably should be using ADPCM already for non-voice parts of your game. How many seconds of phoneme do you have?
     
    Last edited: Jun 4, 2008
  5. Piglet

    Piglet Spirited Member

    Joined:
    May 28, 2008
    Messages:
    175
    Likes Received:
    0
    There may not be that MUCH speech, but for example the European version will have:

    English
    French
    German
    Italian

    possibly some more so each saving is 4+ savings. I know that CELP decompression using an 8-dimentional lookup table works on GBC...
     
  6. babu

    babu Mamihlapinatapai

    Joined:
    Apr 15, 2005
    Messages:
    2,945
    Likes Received:
    3
    what about Speech synthesis? :D
    Thought I've never coded one myself so I don't know how much work would be involved. And it could probably be costly on the cpu when I think about it.. but it saves a lot of space at least ;)
     
    Last edited: Jun 5, 2008
  7. Piglet

    Piglet Spirited Member

    Joined:
    May 28, 2008
    Messages:
    175
    Likes Received:
    0
    Hee hee, I don't think speech synthesis (which is actually LCP10 snippets (formants)) is needed... Hard to get any emotion in synthesis
     
  8. Calpis

    Calpis Champion of the Forum

    Joined:
    Mar 13, 2004
    Messages:
    5,906
    Likes Received:
    21
    IIRC Tokimeki Memorial 2 had good speech synthesis.
     
  9. Piglet

    Piglet Spirited Member

    Joined:
    May 28, 2008
    Messages:
    175
    Likes Received:
    0
    My current thrust in testing

    OK, so I'm going down the CELP route. Speech sampled at 8KHz with 20ms frame size & 5ms code selection. I will use 512 lookups (256 fixed, 256 dynamic?)

    Frame Size = 160 bytes

    codebook index = 9x4 bits
    pitch delay = 8x4 bits
    pitch filter co-efficient = 5x4 bits
    gain = 5x4 bits
    LP co-efficient = 10x5 bits

    Four of the above per frame = 79 bytes.
    79x50 (frames per second) = 3950 bytes per second.

    = 3.86 Kbit/Second.

    A lot of work needs doing in training the fixed lookups. If anyone has some 'hot' information on training stratergies, I would like to hear from them. When I get this prototyped on the PC, I will put up some examples so people can hear what it's like...

    Wonder if there would be a market for this. The decode uses a lot of divide instructions so I will use fixed-point. Still, it's a serious issue since the ARM7 TDMI as used in the GBA doesn't have a divide instruction. Has anyone worked out an optimal divide for this processor? The condition codes avoid branching, but it's still going to eat quite a few cycles. I'm assuming the FASTEST divide is <shift,conditional subtract> repeated 32 times (so 64 cycles)? I'm not sure, but I wonder (at the expense of space) if I could use a rather bulky table of logarithms?

    The LP coefficients could be made smaller using DCPM or VQ but the papers I read say that this reduced quality an awful lot.
     
  10. Piglet

    Piglet Spirited Member

    Joined:
    May 28, 2008
    Messages:
    175
    Likes Received:
    0
    Advances in CELP and possible video-codec...

    I'm now struggling with the use of a mixed fixed/dynamic codebook. 256+256 seems the optimal split, but re-assigning the dynamic codes is a tricky one. Using data from previous frames is obvious, but maybe a bit could be spared to decide IF an entry should be used. But how to allocate? Simply removing the oldest is the easiest, but not the best. Maybe another 8 bits to decide? Extra bits, yes, but if the overall bit-rate can be lowered?

    Remember, compression time is not really an issue, but the decode needs to be FAST.

    Surely someone else has considered this type of compression for a game?

    I wonder, is CELP patented, or just certain concepts? If I could get this working on an ARM7TDMI @ 16.76MHz (i.e. a GBA cpu) then I imagine it might be a commercial product.

    I'm also considering a video codec based on the MP4 ideas. Has anyone worked on wavelets for video-compression yet?

    Please, any input would be gladly recieved, I need people to bounce these ideas off...

    Thanks,
    Sean;-)
     
  11. babu

    babu Mamihlapinatapai

    Joined:
    Apr 15, 2005
    Messages:
    2,945
    Likes Received:
    3
  12. Piglet

    Piglet Spirited Member

    Joined:
    May 28, 2008
    Messages:
    175
    Likes Received:
    0
    I'm thinking of using reSPAMSPAMSPAMSPAMSPAMcals for a lot of the data so divides can be swapped for multiplies, whick makes for a happier CPU-usage profile...
     
  13. Piglet

    Piglet Spirited Member

    Joined:
    May 28, 2008
    Messages:
    175
    Likes Received:
    0
    [​IMG]

    This is the basic principal. Like I ATTEMPTED to say before. I think maybe I can store the r e c i p r o c a l s and use multiplies rather than divides.
     
    Last edited: Jun 9, 2008
  14. devzone

    devzone Robust Member

    Joined:
    May 20, 2008
    Messages:
    248
    Likes Received:
    1
    admin: why is "c i p r o" marked as spam here ?
     
    Last edited: Jun 9, 2008
  15. Piglet

    Piglet Spirited Member

    Joined:
    May 28, 2008
    Messages:
    175
    Likes Received:
    0
    MIPS for coding & decoding speech codecs

    CPU cycles per second for 8KHz sample rate sound.
    Compression % is compared to 16 bit PCM.

    encode decode compression
    u-law: 42K 40K 50%
    ADPMC: 407K 330K 75%
    GSM: 2.0M 950K 89.7%
    LPC: 2.5M 1.0M 96.3%
    CELP 4.5K: 24-52M* 4.4M 96.5%
    CELP 3.0K: 25-47M* 4.0M 97.7%
    LPC-10: 6.4M 3.5M 98.1%
    CELP 2.3K: 24-45M* 3.8M 98.2%
    OpenLPC 1.8K: 2.9M 1.8M 98.6%
    OpenLPC 1.4K: 2.9M 1.9M 98.9%

    *Note on CELP encoding: CELP uses a codebook
    of 256 speech patterns. The CELP encoding
    performance listed shows figures from a codebook
    search of 32 up to the full 256 entries.

    I intend to use a 512 entry codebook with 256 fixed & 256 dynamic
    entries. That will require a LOT of CPU horsepower to encode but
    as the example shows, 4.4MIPS to decode. I imagine that I can
    speed up the decode somewhat at the expense of a little of the
    compression ratio.

    For the DS, I would like to allow real-time encoding as well. With
    such a low bit-rate, you could have an 8-player game with everyone
    speaking at once (which would be nice) but, obviously, the game
    would then have to run on the ARM7 but with 3D fill hardware, what
    do people run out of first? CPU power or draw-cycles?

    Oh, and codebook size improves quality but increases coding MIPS, not a problem for simple decompression. I may well go for a 1024 entry codebook. I could organized the real-time encoder to use 128 or 256 by placing the most general codes at the beginning of the table. Stocastic learned vectors seems like the way to go, but I need the right start-points for the training. I guess I will have to run a lot of different samples through it and average...
     
    Last edited: Jun 14, 2008
  16. Piglet

    Piglet Spirited Member

    Joined:
    May 28, 2008
    Messages:
    175
    Likes Received:
    0
    Of course, since this is going to be coming from an error-proof source (the ROM), I can remove the hamming-code used by CELP (15,11). Doesn't save much CPU time, but it reduces the size of the data somewhat...
     
  17. Piglet

    Piglet Spirited Member

    Joined:
    May 28, 2008
    Messages:
    175
    Likes Received:
    0
    I am in the midst of recoding the US standard CELP to work in fixed point. I'm thinking I will use a 1024 entry fixed codebook. The codebook will be generated stocastically. I think that a codebook per speaker might work well. Adds some bulk, but if you are getting speech at 600 bytes per second and you put a LOT into it then it will be a drop in the ocean.

    I hope I'm not reinventing the wheel here...

    Oh, I think I can get rid of a lot of divides (if not all) by replacing them with cross multiplication. Must watch for overflows....
     
  18. ASSEMbler

    ASSEMbler Administrator Staff Member

    Joined:
    Mar 13, 2004
    Messages:
    19,394
    Likes Received:
    995
    Name of a ill spammers loved to sell here. It sucks, but we used to get so much spam here.
     
  19. Piglet

    Piglet Spirited Member

    Joined:
    May 28, 2008
    Messages:
    175
    Likes Received:
    0
    It's going well....

    Well, I have a fixed-point version of the CELP code working. I settled on a 1024 entry fixed table, so it takes a long time to compress (like 5 seconds per 1 second on a 2GHz PC) but the decode should be fine on a GBA.

    As my first demo, I'm going to sample William S. Burroughs reading his book 'Junky' which comes on 3 CDs. It will use 6MBytes so the extra space will go on some images, spot-FX and a transcript of the book.

    It might be a good way to learn English. If I provide the transcript, would people volenteer to convert into French & German (a must) but ideally also Dutch, Danich, Spanish, Italian, Russian & so on.

    With a simple Huffman-code, the text will average about 2.5 bits per character.
     
  20. babu

    babu Mamihlapinatapai

    Joined:
    Apr 15, 2005
    Messages:
    2,945
    Likes Received:
    3
    Cool.
    You shouldn't worry about the encoding taking long time, rather that then the decoding in-game taking long time :)
     
sonicdude10
Draft saved Draft deleted
Insert every image as a...
  1.  0%

Share This Page