Shared dictionary generation

time to read 3 min | 457 words

As I said, generating a shared dictionary turned out to be a bit more complex than I thought it would be. I hoped to be able to just use a prefix tree and get the highest scoring entries, but that doesn’t fly. I turned to femtozip to see how they do that, and it became both easier and harder at the same time.

They are doing this using a suffix array and LCP. I decided to port this to C# so I can play with this more easily. We start with the following code:

 var dic = new DictionaryOptimizer();

 dic.Add("{'id':1,'name':'Ryan Peterson','country':'Northern Mariana Islands','email':'rpeterson@youspan.mil'");
 dic.Add("{'id':2,'name':'Judith Mason','country':'Puerto Rico','email':'jmason@quatz.com'");
 dic.Add("{'id':3,'name':'Kenneth Berry','country':'Pakistan','email':'kberry@wordtune.mil'");

 var optimize = dic.Optimize(512);

This gives me an initial corpus to work with. And let us dig in and figure out what is going how it works. Note that I use a very small sample to reduce the amount of stuff we have to go through.

The first thing that FemtoZip is doing is to concat all of those entries together and generate a suffix array. A suffix array is all the suffixes from the combined string, and part of it, for the string above, is:

ariana Islands','email':'rpeterson@yousp
ason','country':'Puerto Rico','email':'j
ason@quatz.com'{'id':3,'name':'Kenneth B
atz.com'{'id':3,'name':'Kenneth Berry','
berry@wordtune.mil'
co','email':'jmason@quatz.com'{'id':3,'n
com'{'id':3,'name':'Kenneth Berry','coun
country':'Northern Mariana Islands','ema
country':'Pakistan','email':'kberry@word
country':'Puerto Rico','email':'jmason@q
d':1,'name':'Ryan Peterson','country':'N
d':2,'name':'Judith Mason','country':'Pu
d':3,'name':'Kenneth Berry','country':'P
dith Mason','country':'Puerto Rico','ema
ds','email':'rpeterson@youspan.mil'{'id'
dtune.mil'
e':'Judith Mason','country':'Puerto Rico
e':'Kenneth Berry','country':'Pakistan',
e':'Ryan Peterson','country':'Northern M
e.mil'
email':'jmason@quatz.com'{'id':3,'name':
email':'kberry@wordtune.mil'
email':'rpeterson@youspan.mil'{'id':2,'n
enneth Berry','country':'Pakistan','emai

The idea is to generate all the suffixes from the string, then sort them. Then use LCP (longest common prefix) to see what is the shared prefix between any two consecutive entries.

Together, we can use that to generate a list of all the common substrings. Then we start ranking them by how often they appear. Afterward, it is a matter of selecting the most frequent items that are the largest, so our dictionary entries will show be as useful as possible.

That gives us a list of potential entries:

','country':'
,'country':'
'country':'
','email':'
country':'
,'email':'
ountry':'
,'name':'
'email':'
untry':'
email':'
'name':'
ntry':'
name':'
mail':'
son','country':'
on','country':'
n','country':'
','country':'P
,'country':'P
{'id':
try':'
ame':'
ail':'
'country':'P
country':'P
ountry':'P
untry':'P
ntry':'P
ry':'
me':'
il':'
'id':
try':'P
ry':'P
y':'P
.mil'
y':'
n','
l':'
id':
e':'
eterson
terson
son@
mil'
':'P
erson
rson
erry
ason

One thing you can note here is that there are a lot of repeated strings. country appears in a lot of permutations, so we need to clear this up as well, remove all the entries that are overlapping, and then pack this into a final dictionary.

The dictionary resulting from the code above is:

asonerryson@eterson','.mil'ame':'{'id':','country':'P','email':','country':'

This contains all the repeated strings that have been deemed valuable enough to put into the dictionary.

On my next post, I’ll talk on how to make use of this dictionary to actually handle compression.