Computer cipher solving – Lesson 3: scoring decryptions

A hillclimbing program needs to know when a decryption is better than another. So do other types of attacks. So your program swaps two letters of a key and decrypts, but the decryption looks like gibberish to the naked eye, just as the previous one did using the old key. Which key do you save? There are several methods. I mentioned locating the crib in the previous lesson. But the crib and other normal text is likely to appear only at the end of a successful hill climb, i.e. as you near the top. A better approach to start with is to use statistics.

There are various statistical measures that can be used. They include the Index of Coincidence, the Digraphic Index of Coincidence, the Normor score, word list scoring (i.e. counting how many words appear in the trial decryption, or what percentage of the letters are contained in the words that appear), and others. But the most useful measure I have found is n-gram (or n-graph) frequency scoring. An n-gram is a sequence of n adjacent characters where n is an integer. I have successfully used digram and trigram scoring in the past, especially when I was using Turbo Pascal on a 16-bit machine and the language was not capable of a full tetragram array structure. There were workarounds, but the big breakthrough in efficiency and effectiveness came with the advent of 32-bit (and now 64-bit) machines capable of holding frequency data for tetragrams, i.e. a 27x27x27x27 data structure. I include spaces as well as the 26-letter alphabet in my data, although most others I know in the ACA do not, so their arrays are a bit smaller. I have not found that using 5- or 6-gram frequency data is any better than tetragram, so that seems to be the sweet spot.

The basic idea is to examine the decryption taking it in overlapping 4-letter (or 4-character) sections and adding points to the score based on the frequency of the tetragram in normal English. For example, if your trial decryption were ‘frumqxing…’ you would look up the score for “frum” then add the score of “rumq”, “umqx”, etc. until done. Although these are not words, some of them will have a non-zero frequency in English and thus some score. For example “frum” appears in the phrase “cupofrum” and the word “frumpy.” So how do you know what score to give each tetragram? There are tables of data out there on the Internet if you look hard enough, but I recommend  collecting your own data. To translate raw frequency data into points you will have to decide your own method. I know some people add the logarithms of the frequencies. (Adding the actual frequencies tends to overweight the most frequent tetragrams). Since I do my cipher solving almost exclusively on ACA ciphers, I use my own collection of solved ACA ciphers to collect the data and I give each observed tetragram a score of 1 to 9. I don’t have a fixed algorithm for where the breakpoints are; I just picked ones that seemed to divide up the set into useful-sized chunks. Such data is available for other languages, too. Obviously the source of the data should match the type of text you expect, not only in the language used, but whether it is technical, military, dialogue, etc.

These methods are not mutually exclusive. When scoring a decryption you can use tetragram frequencies but also add some points if the crib appears, or you can boost the score of the tetragrams that appear in the crib. I’ll discuss use of the crib in scoring more in a later lesson. Use of the IC or DIC in combination with n-gram scoring sometimes is helpful, too.

This site uses Akismet to reduce spam. Learn how your comment data is processed.