Word Length Patterns

For my hobby of cipher solving I decided to try a new technique, one I call word length patterns. The concept is simple: compare the pattern of word lengths at the beginning of the cipher to known patterns that match to gain an idea of probable words that fit. I restricted it to the beginning of sentences.

To use this method, one must be able to determine where the word breaks are. This would be true for typical cryptograms, i.e. aristocrats in the jargon of the American Cryptogram Association (ACA). However, aristocrats are quite easily solved by other methods and the ones that are too tough for conventional methods no doubt also have atypical word lengths. However, there are other cipher types, not so easily solved, where word breaks are shown or can be easily determined. These include the Ragbaby, Tridigital, Sequence Tramp, and CONDI. To use the method to help solve, simply make note of the pattern of word lengths for the first three to eight words and write out the numbers in order, separated by hyphens. For example, “The quick brown fox jumps over the…” would produce the indexing sequence 3-5-5-3-5-4-3. Look up in a reference source that same pattern (or at least the first three or four numbers) to see what the most common, or most likely, set of words produce that pattern,.

Clearly, this method requires a reference source that includes similar sentences to the one you are trying to solve. I wrote a program to analyze dozens of books I downloaded from gutenberg.org. These were almost all novels from the 19th and 20th centuries, which limits their usefulness, but it is an easily obtained large block of English sentence data. I processed these books, taking only sentence beginnings and only sentences that had at least four words to compile a data base of patterns. There were just over 141,000 sentences in the data. I provide a link to that file at the end of this post in case you want to download it. Searching that file for the above pattern there were 25 instances beginning 3-5-5-3-5. The vast majority began with the word “the” but none continued with the word “quick.” In fact, the 25 second words were all different. In short, there was no clear winner for that pattern. I found this unsurprising since it would have been quite odd for that sentence or one much like it to appear in an old novel since it is an artificial sentence (viz. a pangram) created to test typewriter keyboards.

I then took the data and examined it to see what patterns were the most common. The most common pattern for the first five words was 2-3-4-2-3. There were 158 sentences that began with this pattern. The most common words meeting this pattern were “do you mean to say” at 12 instances, but even more common was the pattern “at the ???? of the” where the center word could be any of several 4-letter words, such as foot, base, edge, head, gate, etc. This could be useful in some cases, I believe. I then did the same process restricting output to cases where a 5-letter word appeared somewhere in the first five, and then again requiring a 6-letter word. The results were 2-3-5-2-3 and 2-3-6-2-3, identical to the previous pattern except for the center word. The most common words fitting the pattern, like the earlier case, was either “at the ????? of the” or “in the ????? of the,” especially “in the midst of the” for the first case and “center” or “centre” for the third word in the second case.

I don’t consider this method a success, but neither do I feel it is totally useless. For example, I did a similar data processing on my collection of over 6000 solutions to ACA ciphers and searched for the pattern 3-10-7. There were four sentences beginning with that pattern. Three of the four had the second and third words as “difference between”. Two of those began with “the,” the other with “one.” I believe that if I were to encounter a CONDI, Ragbaby, etc., with this particular pattern, this data would prove useful. Selecting the right reference data is obviously key.

If you want a copy of the gutenberg novel data, click here.