Using word patterns to aid in solving cryptograms and other ciphers is very common and doesn’t require computers. Pattern words appearing in ciphertext like EDQDQD or VXQGDH might or might not require a dictionary search, but a computer will soon find BANANA and ROCOCO are the only two common English words matching the first pattern of letters. However there are thousands of words that match the second word, which has no repeating letters. If they appear together as a phrase, though, it is possible to cross-correlate the two words, using identical ciphertext letters in the two words to reduce that second list. It turns out there are only sixteen two-word combos in my basic word list that match the phrase pattern and BANANA SUNDAE appears to be the only one that makes sense.
What about when word divisions are not known? The problem becomes much harder. I decided to try to use patterns in English to aid in solving ciphers where word spacing is unknown. First it is necessary to have a standard format for patterns. I use what I believe is the most common method, which is to assign A to the first letter in a text string and to any other identical letters in the string, then B, C, etc. Thus EDQDOD has the pattern ABCBDB. BANANA and ROCOCO both have that pattern, although I am working with strings, not words. I next chose a string length to use. I experimented and settled on eight letters. I call these 8-grams. Any shorter and there were too few unique common patterns, more and there were too many. I then tabulated the frequency of different 8-gram patterns by removing spaces and punctuation from over a dozen books and speeches in English downloaded from Project Gutenberg. The resulting list of unique patterns observed resulted in 2981 entries. By far the most common was ABCDEFGH, i.e. strings with no repeated letters, which represented about 18% of the total. The next most common was ABCDEFGA, followed by the 8-grams with a single letter repeat separated by four or five letters, such as ABCDEFAG and ABCDEFGC all of which had a frequency of about 2%.
My first idea was to use these frequencies to aid in solving ciphers where there has been a simple substitution combined with a transposition of some kind. If one could decrypt the transposition using patterns, it would then be a simple matter to solve the resulting intermediate ciphertext as a simple substitution cipher (in American Cryptogram Association or ACA terms, a Patristocrat). The only ACA cipher that uses a combination of simple substitution and transposition is the Bazeries, but it could be done with other types such as combining a columnar or route cipher with simple substitution. To accomplish this I had to devise a measure of how closely normal English adhered to these pattern frequencies and do the same for scrambled English text. I experimented with this and settled on using the 50 most frequent patterns as the basis. I assigned a score to a string of text as follows: for each ciphertext 8-gram in the text, find its pattern, search the list of fifty most common patterns and if a match is found, add the number of the match to a running total. Thus if the pattern is the third most frequent one, add 3. If no match is found in the 50, add the number 50 to the total. Divide the sum by the number of 8-grams in the text (i.e. the length of the text minus seven). The resulting number is the score for that text. For convenience I’ll call that the pat8 score. I found that normal English text averages about 25. When I tested 30 plaintext segments derived mostly from BION’s list of 10,000 book excerpts I found the median to be 25.25 and the range was from 19.95 to 37.52. The score did not correlate closely with the text length, but it did with index of coincidence (IC). Generally the higher the IC, the higher the Pat8 score. The 37.52 score plaintext had an IC over 0.09 (average for English is 0.067).
The next step was to scramble these texts and see if the pat8 score differed. I used random keywords to encipher these 30 texts using columnar transposition. The resulting pat8 scores had a median of 32.69, which is 28% higher than plaintext. This seemed promising for my purposes. There was more variation in the plaintext (standard deviation of 4.04) than in the scrambled text (SD of 2.82). Of the 30 tests only once did the ciphertext score lower than the original text and in one case they scored the same. In all the others, the plaintext had a lower score than the ciphertext. I also tried scrambling the plaintexts using the myszkowski cipher but there was no apparent difference between the columnar and the myszkowski. The correlation between IC and pat8 held for scrambled text as well.
Using this test I tried to solve some columnar ciphers, scoring the trial solutions only with the pat8 score. This was a total failure. Although the transposed text scores were mostly quite a bit higher than plaintext, there were always some keys that produced ciphertext with a lower pat8 score than the plaintext. I had the same result with myszkowski ciphers. I abandoned the idea of using the test as a solving aid. I invite others to try experimenting with this methodology to see if a useful solving aid based on patterns can be devised. If you would like my list of patterns and their frequencies, contact me in the comments or using the contact form link in the top menu.
Despite this setback, I then hoped that the test may be useful in diagnosing an unknown cipher type. I will discuss my experiments and results in my next post.