Cet assombrissement est soumis au nom de Paris.pm canal assombri.

COMPRESSION MUSINGS

So this year the rules have changed. 7 months of work turned into ashes
because I didn't believe one my fellow Paris.pm obfuscator who told me
that they would...

"512 bytes! And they want me to be creative?!", where my first words.
All my creativity had gone into two other obfuscations, meant to
compete in a 1024 characters category.

"Then I'll compress my beauties, and stuff them with a decompactor in 512
bytes!" Ah. Little did I know about compression algorithms, except for
Huffman, which you must have already judged. So I digged deeper and
found information about this LZW algorithm.

In the end, I decided to submit the compressors/decompressors, rather than
the original obfuscations (but who knows?).

SO WHAT DOES IT DO?

This program compresses the data on STDIN or on the first file on the
command-line using a LZW-like algorithm. If the command-line parameter -z
is given along with an integer, the dictionary indexes will be stored onr
(8 + that integer) bits. For example, if called with -z 4, the dictionary
will hold at most 4096 entries, and every symbol in the compressed file will
be stored on 12 bits. Choose -z carefully, if you want to actually decrease
the size of small files. By default, -z is 4.

COMPRESSING THE COMPRESSOR

Since the compressor was to big for 512 characters, I had to compress it.
Since the decompressor (see my next entry) was too big too, I could not
stuff the decompressor and the compressed compressor in the same file,
with a command-line parameter to decide wether to compress or uncompress.

So most of the compressor code is a single-quoted string, where '$0' was
replaced by '@' and 'pack' by 'Q'. The last substitution puts the code
together again (with the help of the king's comitee-designed horses), and
then it eval'ed.

YES, BUT HOW?

The algorithm is used is described with much clarity in the comp.compression
FAQ, as well as in a Dr Dobbs's Journal article from October, 1989 by Mark
Nelson. See the following URLs for more details:
 * http://www.faqs.org/faqs/compression-faq/
 * http://www.dogma.net/markn/articles/lzw/lzw.htm

If you run perl -MO=Deparse on the result of the substitution, the loop
appears quite clearly.

There are not much tricks here, appart from the compressing s/// (where the
/e switch uses a short (compressed?) if-then-else to decide what was really
matched. OK, there's a goto. But the program behaves a little differently at
the beginning of the compressing loop. So there.

The data is read in larger chunks as the dictionary grows only because I was
desesperate for spare bytes (still trying to stuff compressor and decompressor
in the same obfuscated half-kilobyte file) not to put 2048 or 4096 in each
read().

The output is in $o, which is printed out regularly (but always in multiples
of 8, as s/(.{8})*/ says well). The first output byte holds the value of
command-line parameter -z. This is used by the decompressor to size its
dictionary adequately. The last byte is eventually 0 padded, but this is not
a problem, as you will see in my next entry.

I used 16 bits unsigned shorts in "VAX" (little-endian) order, so that I could
be sure of the numbers that are manipulated by the program.

FAMOUS LAST WORDS

It is only by chance that the string "b$d" appears in this entry: these
entries were written on a Linux workstation (though shielded by an OpenBSD
firewall) and no pun was intended.

AND WHAT DOES IT NOT DO?

There are a few modifications to the original algorithm that I intended
to include, but couldn't because of lack of time (and nasty bugs):
 * Data could be cut on any number of bits, not just 8 (for bytes)
   I think some files are better compress if you look at them as if
   they are composed of groups of 4 bits.
 * The dictionnary indexes would be stored on as little bits as possible:
   if your symbols are 1 byte long (the default), the indexes in
   the dictionnary will be stored on 8 bits for a start (the first
   byte). When the number of items in the dictionnary grows to the next
   power of 2, the indexes are stored on one more bit.
 * The decompressing would have worked the same, it reading symbols 8 bits
   at a time, and grows the symbols size as soon as the indexes become too
   big (it actually very simple to do, since the two dictionnaries are
   constructed in the same manner and order)

Maybe next year... but this time I won't prepare myself too soon.
