Cipher analysis – the Condex

A few years ago I developed a statistical test called the Normor test that measured how closely the letter frequencies of a cipher resembled normal letter frequency. This turned out to be quite useful as a diagnostic tool for identifying the type of an unknown cipher. One shortcoming of that test is that it does not distinguish between transposition types. They all have the same Normor score as the plaintext. It occurred to me that something similar could be devised that measures contact data to see how closely that data looks like normal contacts. This might possibly be useful in distinguishing between transposition types or even other types.

First I had to write a program that tabulated contact data in a usable form. This proved to be a bit of a programming challenge for me, but I succeeded in writing a program that put the data in a form similar to the chart appearing on page 220 of Elementary Cryptanalysis (Elcy) by Helen Fouché Gaines. I used the program to produce the following chart from dozens of novels, speeches, and other English-language materials downloaded from Gutenberg.org.

MRSHE A NTRLS
DSOEA B EOLUA
SNAEI C OAHET
LIAEN D EIOAT
VLTRH E RSNAD
NSIEO F OITRA
UEAIN G EHAOR
GWSCT H EAIOT
SHLRT I NSTCL
YTSEN J UOAEI
ONRAC K EISNA
OIELA L EILAO
UIAEO M EAOIP
UEOAI N TDGEO
HSNRT O NRUFM
MAOSE P EORAL
ANISE Q UAIEH
TUAOE R EOIAS
URAIE S TAEIO
EIANS T HOIEA
RTBSO U SRTNL
ROIAE V EIAOY
DSTEO W HIAEO
SAOIE X TPIEA
ETARL Y OSTAI
OEZAI Z EAIZO

This differs slightly from the Elcy chart in that I limit the contacts to five on each side, but the data is much more inclusive since it is based on much more data. Use the central letter and look outward to see the letters that most frequently immediately precede (left side) or follow (right side) that central letter. For example, the letter that most often contacts Y on the left is L. The second most frequent one is R. Similarly on the right the most frequent contact is O, then S.

I use this table as my normal English standard. The program was then run on some sample ciphers. Since they are typically too short to fill both sides of the table, I do that with periods. Here’s a columnar cipher and the resulting table:

srwhogteratwiabrndhgpiainishewslalleuniiobysonoooteiftaosslhnaietnesemtnkmfosutiaetasthoihsrtitafuhrenoeeteegfesooshahttrenpdtlvhidurrbsnossnoeseqarebdmgssmetef

nlhti a ibefh
.roea b drsy.
….. c …..
.pnib d hmtu.
soert e tsenb
migea f eotu.
.omhe g fpst.
lidas h oaegi
hetna i adefh
….. j …..
….n k m….
.tlas l aehlv
.sked m efgt.
ihtse n oiade
ashon o soebg
…ng p di…
….e q a….
hebas r eabnr
baseo s sehln
gfdae t eainh
.sfed u hnrt.
….l v h….
..tre w his..
….. x …..
….b y s….
….. z …..

When two letters share the same frequency they are listed in alphabetical order from inside out. This contact chart could be useful solving many ciphers such as cryptograms by hand, but my aim was to measure how much this set of contacts matches the standard above. After some experimentation I found the best way to do this was to go row by row and take each character in this target ciphertext that appears to the left or right of the central letter and take the difference between its position in this lower chart and its position on the same side of the same row in the normal chart and keep a running total.  For example, for row B, the letter A is the most frequent left contact in both charts so the difference in positions is 0. For the right side, the D is most frequent in the cipher but doesn’t appear in the normal, so I add 5 for each such instance. For the K row, N is in position 1 in the cipher, but 4 in the normal chart, so the difference of 3 is added. When all 26 rows are totaled, I divide by the total number of letters appearing on the right and left sides of the cipher (ignoring periods) to arrive at an average position difference. I call this number the Condex for Contact Index. If the cipher contacts exactly matched the normal chart, the total (and average) would be 0. If none of them appeared at all in the normal list, it would be 5. In short, the higher the score, the less normal.

I found that English plaintext averaged in the low 2 range, i.e. 2.0 to 2.25. I tested paragraphs of some novels and the highest average score was 2.487, with a single high of 3.06 and a low of 1.74. My file of ACA solutions averaged higher, 2.79, but bear in mind that it contains very non-standard constructions like the Patristocrat specials and Playfair solutions with X’s between the doubled letters. When I tested several transposition cipher types (testing hundreds to thousands of each type) I found they averaged in the mid- to high 3’s. In order from low to high they were Amsco, Myszkowski, Columnar, and Swagman. The average score and ranges of the latter three were nearly identical, but the Amsco was noticeably lower, which makes sense since the typical Amsco ciphertext consists of about 2/3 normal digraphs. It averaged 3.45. Amscos were the only ciphers I tested that had scores below 3, going as low as 2.8. The lowest among the others was one Swagman con at 3.15. Thus the Condex could be helpful in identifying an unknown Amsco. However, I must note that there are other easier ways to do that such as counting common digraphs.

For non-transposition types the scores were much higher, both the average scores and the maximum and minimum scores. I tested the following types: Bifid, Two-Square, Foursquare, Fractionated Morse, Quagmire,  Bazeries, and Vigenere. I used Bion’s 2-square/4-square data for those types and generated my own for the others. The differences in ranges of scores were so slight as to be meaningless. The averages ranged from 4.12 to 4.29. The Two-Square had the biggest variation and some of the lowest ones dipped down into the mid-3’s. The Condex might be useful in distinguishing between transposition and substitution or fractionation types, but that, too, is more easily and accurately done with the Normor or other tests.

The algorithm is too computation-heavy to be used in any iterative solving process like hill-climbing and I don’t see how it would help there, anyway. Although I don’t see any future as a type diagnostic tool for the Condex, the tool is at least useful for some hand-solving and might prove useful for tabulating data for foreign languages. Anyone who wants to experiment with it, contact me and I’ll provide you with my Windows executable program. There’s a contact link in the top menu.

These results are valid only for text lengths in the typical range for ACA ciphers. I used a minimum of at least 100 letters for my testing and anything below that becomes almost random, even for plaintext. The maximum length was probably around 300 letters. For very large data samples of English, for example, the score will drop virtually to zero.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.