Text compression

Jeremy Leach · Jun 30, 2015

Hi,
I haven't posted here for what seems like a very long time , but I've just added a little document on text compression in miscellaneous projects if anyone is interested. I've knocked it up from thinking it through for my own purposes making a synthesizer controller using PICAXEs and having to cram song names into 256 bytes. The project is ongoing but doing nicely so far and I'll probably add examples of my text compression code at a later stage. It'll be on an 18M2.

Jeremy

eggdweather · Jun 30, 2015

What sort of compression levels are you achieving? I can see how 2:1 would be quite easy to do and 3:1 slightly harder but achievable.

techElder · Jun 30, 2015

That's a very interesting document. Your work here will perhaps allow more verbose messaging in my projects.

Jeremy Leach · Jun 30, 2015

eggdweather said:
What sort of compression levels are you achieving? I can see how 2:1 would be quite easy to do and 3:1 slightly harder but achievable.

Hi, well in theory 3:1 for method2 but haven't actually got to loading it up and trying it yet, but I can't see why it shouldn't give around that. I'll report back in probably a few weeks now.

lbenson · Jun 30, 2015

Note: You might want to rename your project heading from "Test" to "Text".

eggdweather · Jun 30, 2015

Out of interest Morse code (often thought to be a random assignment of dots and dashes) is based on a similar theory of frequency of occurrence of characters e.g. the letters 'e' and 't' are dit and dah respectively and then code length is increased by frequency of occurrence based on Shannon's entropy of the English language, not bad for a 1844 theory!

I recommend you have a look at a Morse decomposition tree, as it may help in the coding of your compression algorithm and Shannon's theory as you should be able to get to ~5:1 using that. For example considering how English words are constructed like my post here, I can see (vowels) that I could codify regular sequences constructs like 'the' search for that sequence then codifying it, then 'then' become nibble + character like 'th-vowel' would enable 'tha..' or 'the..' to be coded. It would need more syntax analysis in your code but it starts to increase compression. Even noise words like 'and' or 'the' or 'are', 'of' etc can be tokenised / codified adding more compression.

Jeremy Leach · Jul 1, 2015

It was 1989 when I last looked at Shannon's theory ! I'll take a read up about Morse, thanks eggdweather. However I expect it'll end up being a trade-off as in a lot of things PICAXE. I need to have snappy response on my display and effectively be able to 'scroll' through a list of songs. So no time to wait for complex decoding of 5:1 compressed strings of text. Only time and experiments will tell though and I'm not at that stage yet. I'm lucky because I do have quite a bit of spare program space on my 18M2 at the moment though and could have quite a large table for the decoding side, so it's definitely tempting.

techElder · Jul 1, 2015

I'm lucky because I do have quite a bit of spare program space on my 18M2 at the moment

Hey, there's always the 40X2 ... jus' sayin'

Jeremy Leach · Jul 1, 2015

Board space has made me go for medium chips Tex.

edmunds · Jul 1, 2015

Jeremy Leach said:
Board space has made me go for medium chips Tex.

In my experience, this surface mount picaxe is not difficult to hand-solder and adding a download circuit to your project is anyway a good idea, since taking the chip in and out of the socket will kill some leg sooner or later.

If you don't need too many inputs and outputs, you could even solder thin wires to the pins you need and epoxy them to the bottom of the chip and then solder the wires to the breadboard. If it is a one-off project. Just sayin'

.

With 40x2 you have 4x4096 bytes of program memory that can store a lot of strings just as they are

. And you have 1024 bytes big scratchpad and 256bytes of eeprom, of course. Plenty of space. You might find interesting ideas in this thread.

Maybe you need no compression at all. While I do find the compressing idea interesting and am very much looking forward for some code on how you propose to implement it. For me that is the tricky part

.

Good luck,

Edmunds

Buzby · Jul 1, 2015

Very impressive results, definately 'pushing the envelope' of the PICAXE.

A further idea might be to try analysing the complete song list to make an array of unique words.
This might be small enough to give another level of worthwhile compression.

mrburnette · Jul 12, 2015

eggdweather said:
Out of interest Morse code (often thought to be a random assignment of dots and dashes) is based on a similar theory of frequency of occurrence of characters e.g. the letters 'e' and 't' are dit and dah respectively and then code length is increased by frequency of occurrence based on Shannon's entropy of the English language, not bad for a 1844 theory!

I recommend you have a look at a Morse decomposition tree, as it may help in the coding of your compression algorithm and Shannon's theory as you should be able to get to ~5:1 using that. <...>

My gut feeling is that an average across a large dataset will yield about 2.5 ... similar to the old DOS version of PKZIP by Phil Katz.

Having worked extensively with the Magic Morse algorithm, the decomposition tree analogy is interesting but I feel that it is not truly applicable here. ZIP is a public domain and well-understood product, originally written in the days when CPU-intensive coding was still a concern.

Ray

Jeremy Leach · Jul 20, 2015

Hi,
After my initial enthusiasm for getting deep into this interesting subject I must say I've had a U-turn and redesigned to use an i2c EEPROM and abandon this compression idea ! Others had suggested it and I must admit that for the relatively tiny amount of PCB 'real estate' needed by an 8 pin EEPROM chip it offers quite a vast ocean of storage (in PICAXE terms). It eventually made sense to me because it allows some good enhancements to my project. That's not to say that other projects couldn't benefit though, especially if space was at a premium.

Text compression

Jeremy Leach

Senior Member

eggdweather

Senior Member

techElder

Well-known member

Jeremy Leach

Senior Member

lbenson

Senior Member

eggdweather

Senior Member

Jeremy Leach

Senior Member

techElder

Well-known member

Jeremy Leach

Senior Member

edmunds

Senior Member

Buzby

Senior Member

mrburnette

Senior Member

Jeremy Leach

Senior Member