Computer cipher solving – Lesson 5½ : cribs revisited

I thought it might be useful to expand a bit on the use of cribs. In particular, I’d like to go into more detail on what I called Length scoring back in Lesson 5. Hence the captioned 5½ on this post. Here’s the original paragraph on that reposted for convenience:

Length scoring: I’ve found this to be a quite effective improvement to tetragram scoring, although they can be used together. Like tetragram scoring it has the advantage of not requiring any additional programming on individual ciphertexts, but unlike tetragram scoring, it does use up a bit of extra run time. It solves the problem I just mentioned in the previous paragraph. What I do is run the crib down the decryption and in each spot count the number of letters that are in the same place in both crib and decrypt. In the example above hisbeard and hixbeaqd have six letters in common. I then take the highest-scoring instance for the length of a decryption, 6 in this example. I typically take that number, subtract 3 (assuming it is at least 3),  and square the result, then add that to my score. In this example it would add 9 points (6-3 squared) to the score, the equivalent of a high-scoring tetragram. I use this method mostly on cipher types that have longer cribs. It has a good ability to hold hillclimbers close when they get close. It works well with a wide variety of cipher types, but not as well on transposition types or combination tramp/sub types like Bazeries or Myszkowskis. Those types may have the crib letters in close proximity to each other, but not in the right order, or with an extra letter or two between. I’ve considered writing something that will give extra points for those situations, but I haven’t been industrious enough to do that yet.

I think it’s worthwhile to follow a more typical example than what I used above. Let’s take AC-1159 in the MA2017 issue, a 6×6 Seriated Playfair. The crib is SELSEWHERETOESTAB. Clearly we can safely extend that to SELSEWHERETOESTABLISH, a crib of length 21. This method works better with longer cribs. Seriated Playfairs are not ideal types to use. As long as you have the correct seriation period, a trial decrypt that is getting close to the correct  solution will usually have some crib crib letters in their correct relative positions, however, this cipher type inserts extra X’s to avoid doubles so the crib and correct decryption may not match. Since I happen to know the crib does fit exactly in this case, I will use it. Now the point of this crib method is to identify a trial solution that has a section that “looks like” the crib, i.e. is more like the crib than random chance would dictate, and then boost the score of that trial decrypt in an amount relative to the degree it departs from random chance (and is thus likely to be generated by the crib) .

First we need to establish what random chance would dictate, since we don’t want to boost the score of a trial decrypt that shows some similarity to the crib here and there by chance. Since the index of coincidence in English is around 7%, random chance would dictate that if you compare the crib to any trial decrypt that is close to English in its index of coincidence and letter frequencies, 7% of the crib letters are going to match the decrypt letters. For this 21-letter crib, that’s about two letters. Of course this is only an average. Some will hit three, four or even more letters by random chance while in other cases there will be no matches. Bear in mind that, assuming we haven’t placed the crib by other means, we are not testing the crib in just one spot. We are running the crib through the entire trial decrypt and using the highest scoring spot. We don’t care how well the crib fits lots of different spots, but whether there is one spot where it really strongly seems to fit. Since the length of this con ct is 190 characters, that’s 190-21 or 169 comparisons. The question thus arises, given random variation, what can we expect the maximum number of letter matches to be by random chance in 163 trials? We need that to establish a baseline number.

There’s no doubt a way to do this using the index of coincidence, lengths, and known probability formulas, but for me it’s easier just to write a program that tests this. My somewhat limited testing indicates that for a ct and crib of this length, random chance will produce a best fit for the crib of 5 or 6 letters even if the crib is totally unrelated to the correct plaintext. So a positive result is really only indicated if your test shows seven or more letters that match the crib in the best spot, and even seven is within the range of normal. The shorter the crib and shorter the trial decrypt being tested, the smaller that number will be. Since most cribs and ACA cons are shorter than this example, I use normally 3 as my baseline since I don’t have a chart or formula that applies to all crib and ct lengths. Even though testing shows 6 is probably a better number to use for this con, let’s examine it using my normal 3.

The way I use this to score a trial decrypt is with a routine called CribFit that runs the crib along the trial decrypt and in each possible spot measures the number of letters that match crib with decrypt. I find the maximum number for that decrypt, let’s assume this placement: “qelmnuharptorrtingise”, which produces 9 letter matches with the crib.

selsewheretoestablish
qelmnuharptorrtingise
-xx---x-x-xx--x---xx-

Subtract the baseline number of 3, and square the difference. Here 9-3=6 so that would add 6×6 or 36 points to my score, a significant enough number to influence the hill-climbing function. So even though “qelmnuharptorrtingise” does not look to the eye like a good crib fit, the computer recognizes it as one. If 10 letters matched the score would increase by 49 points, and 11 would produce 64. As you can see, the change in decrypt score really starts changing a lot as a long crib appears in the decrypt.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.