In my last post I discussed how to break up undivided text strings into words in order to be able to score trial decryptions. In that post I mentioned blanking out words that have been found in order not to double count letters. I’ll provide an example to clarify: suppose your trial decryption is “HBSTUMBLEDTXERE.” Using the length-based approach I recommended, your algorithm would first find the word STUMBLED when it gets to length 8. What you should do is then convert the test decryption to HB——–TXERE for the next loop. The final score would be 8, or, if you have ERE in your word list, 11. If you leave the trial decryption in the original form, the algorithm would find TUMBLED, STUMBLE, TUMBLE, BLED, and LED as it tested shorter word lengths. The same letters would be counted multiple times and artificially increase the score. What you’re really looking for is the percentage of the decryption that forms words.
I suggested the score should be the number of letters that are found to be in words and that’s a simple way that works. But I believe a test decryption with longer words in it is more likely to be on the right track, and the score should reflect that. So you may wish to give extra points for longer words. Perhaps add L*(L-1) or L*(L-2) where L is the length of the word found would work better. You may want to deduct unmatched letters. But you have balance this against the situation where your bonus rewards inaccurate parsing too much. For example HULALIGHT would parse as HUL ALIGHT. HOLYLIGHT would split after the Y. The second form is more likely right. With simple letter count, the second one scores higher. But what if there’s a bonus for length? HUL ALIGHT might score higher depending on how big a bonus.
Now compare HULALIGHT to HALFLIGHT and to HOLYLIGHT. Without a length bonus these score the same. How do we fix that so the best one scores highest? It’s possible to try alternate ways of parsing the word divisions and then scoring higher for the result that is most common, as I’ll explain below. I’ve found this to be much too time-consuming to be practical in hill-climbing. Another way I do use for cipher types with word divisions such as Aristocrats, Key Phrase, and Tridigital, is to score word pairs together based on the frequency of the pairing. I’ve downloaded 2-grams from the Google N-gram site and processed the files to delete irrelevant data, adjust the frequency numbers to more manageable size and spread, and alphabetize them. I use these files to score the word pairs. You might be tempted just to use the frequency of the individual words rather than the pairs, as it would be faster to look up. That might work with the above example since HALF is more common than HULA or Holy. So is HALF LIGHT the most common combo. But take TWENTY T??. The most frequent combo is TWENTY TWO but TWENTY THE would score higher based on individual word frequency. Tetragram scoring can take care of some of this and for simple substitution types cost in time is rarely worth two-word scoring. But for some very tough ones, like very short Headline puzzle headlines, it’s the only way to crack them, and in a Key Phrase I’ve found it essential.
I mentioned above that trying alternate means of parsing is not practical for hillclimbing. But another application of this technique is not used for solving at all. When I solve a cipher and the result is without word divisions, I like to submit the result in a more readable form for the Solutions Editor. That means I insert spaces at the word boundaries. My program to do this uses four steps. The first step consists of dividing up the text using the technique I described in my first post, but either method will work for step 1. The second step is to score the word pairs from the N-gram data as described in the last paragraph. These two steps are virtually instantaneous. Then the program goes back and checks for common parsing errors I’ve compiled in a list, such as changing “i snow” to “is now.” This is still instantaneous. The final step is to go back through and try moving one and then two letters to the left and the right at every space and testing the resultant word pairs. For example, “butt he” would be tested against “but the” and “bu tthe.” “But the” would outscore the others and be the output form. All this testing and then altering the output is very slow. For one already solved text the forty seconds or so it takes is acceptable but it can’t be used in hillclimbing.