Using shared dictionary during compression

time to read 5 min | 852 words

After the process of creating the actual dictionary, it is time for us to actually make use of it, isn’t it?

Again, I’m following the code from FemtoZip here, and I’m going to explain how it actually uses the dictionary for compression. The magic is happening in the PrefixHash class. Let us see the calling code:

var dic = Encoding.UTF8.GetBytes("asonerryson@eterson','.mil'ame':'{'id':','country':'P','email':','country':'");

var text = Encoding.UTF8.GetBytes("{'id':11,'name':'Anna West','country':'Nepal','email':'awest@twinte.gov'}");

var prefixHash = new PrefixHash(dic, true);

var bestMatch = prefixHash.GetBestMatch(0, text);
Console.WriteLine(Encoding.UTF8.GetString(dic, bestMatch.BestMatchIndex, bestMatch.BestMatchLength));

The output of this code is: {‘id’:

How does this work?

When we create the prefixHash, it generate the following table by hashing every 4 bytes and storing the relevant position in them.

hash[  5] =  58;
hash[ 11] =  71;
hash[ 13] =  22;
hash[ 14] =  11;
hash[ 17] =  23;
hash[ 30] =  16;
hash[ 34] =  61;
hash[ 35] =   5;
hash[ 37] =  70;
hash[ 41] =  65;
hash[ 45] =  54;
hash[ 49] =  66;
hash[ 57] =  36;
hash[ 58] =  29;
hash[ 63] =  57;
hash[ 65] =  60;
hash[ 66] =  56;
hash[ 67] =   2;
hash[ 72] =   0;
hash[ 73] =  28;
hash[ 78] =  62;
hash[ 79] =  35;
hash[ 80] =  51;
hash[ 87] =  55;
hash[ 89] =  15;
hash[ 91] =   9;
hash[ 96] =   7;
hash[ 99] =  67;
hash[100] =  52;
hash[105] =  21;
hash[108] =  25;
hash[109] =  69;
hash[110] =  68;
hash[111] =  64;
hash[118] =  17;
hash[120] =   4;
hash[125] =  33;
hash[126] =   3;
hash[127] =  26;
hash[130] =  18;
hash[131] =  31;
hash[132] =  59;

The hash of {‘id (the first 4 bytes) is 125. And as you can see, that maps to position 33. That means that there is a likelihood that there in position 33 there is a the value {‘id. What we do then is check, and continue to run through the code as long as we have a match.

That is how we can figure out that there is a 6 character match starting at position 33. The actual code is more involved, of course, and we need to figure out if there might be another match, elsewhere in the dictionary, that might serve better. Another issue when we need to actually compress is that while we can use the dictionary for compression, it is also actually possible to use the plain text we are compressing as another dictionary as well. Which is what FemtoZip is doing.

Basically, the logic goes like this. Check the current position for an entry in the dictionary, then check if we already had this value in the plain text we have seen so far. Select the largest match, then output that.

Here is actually using it all:

var dic = Encoding.UTF8.GetBytes("asonerryson@eterson','.mil'ame':'{'id':','country':'P','email':','country':'");

var text = Encoding.UTF8.GetBytes("{'id':11,'name':'Anna Nepal','country':'Nepal','email':'awest@twinte.gov'}");

var substringPacker = new SubstringPacker(dic);
substringPacker.Pack(text, new DebugPackerOutput(), Console.Out);

The output is meant to be human readable, and the compressed text is:

<-43,6>11,'n<-60,6>Anna Nepal<-40,13><-18,8><-68,8>awest@twinte.gov'}

Note that we saved a total of 41 characters due to compression (assuming we don’t count the cost of actually encoding this.

Now, what about those references? They look very strange with those negative numbers. The reason those numbers are negative is actually quite simple. They aren’t dictionary entries, like you would think. Instead, they are back references. In other words, the first <-43,6> call is actually saying: go backward 43 bytes, then copy 6 bytes.

But we just started reading the compressed text, where do we go backward to? The answer is that we go backward into the dictionary. So all the references in the text are always relative to our current position. Let us resolve this compressed string one step at a time.

<-43,6> means go back 43 bytes into the dictionary and copy 6 bytes, giving us a string of:

{‘id’:

Then we have “11,n” literal that we append to the string:

{‘id’:11,’n

Now we need to go 60 bytes back (from the current end of the string) and copy 6 bytes giving us:

{‘id’:11,’name’:’

The literal “Anna Nepal” giving us:

{‘id’:11,’name’:’Anna Nepal

Then we have to go 40 characters back, and copy 13 bytes:

{‘id’:11,’name’:’Anna Nepal’,’country’:’

Now this is fun, we have to go 18 chars back, and for the first time, we aren’t hitting the dictionary, we are using the actual string that we uncompressed to generate the rest of the string:

{‘id’:11,’name’:’Anna Nepal’,’country’:’Nepal’,

Another backward reference, 68 steps and copying 8 bytes (again to the dictionary):

{‘id’:11,’name’:’Anna Nepal’,’country’:’Nepal’,’email’:’

The literal awest@twinte.gov’} completes the picture, giving us the full text:

{‘id’:11,’name’:’Anna Nepal’,’country’:’Nepal’,’email’:’awest@twinte.gov’}

And that is how FemtoZip works. And that is pretty neat.

The actual implementation is doing Huffman compression as well, but I’ll touch on that in a later post.