# Mathematical formula needed

Today’s blog post is not about an unsolved cryptogram. Instead, I need the help from my readers to solve a mathematical problem. The solution might be helpful to break encryptions.

Breaking a simple cipher usually works best if one counts the letter frequencies or guesses words. Readers of this blog have used these techniques hundreds of times to solve encrypted postcards, diaries, letters and other texts.

### Letter contacts

However, there are cases, in which frequency analysis and word guessing alone don’t work. For example, the cigaret case cryptogram

… looks like a simple cipher (with a simple cipher I mean monoalphabetic substitution, also known as MASC), but all efforts to break it have failed. The same is true for the seal cryptogram.

Of course, it is possible that these two cryptograms have withstood all breaking attempts because they are no MASCs at all. Nevertheless, I wouldn’t reject the MASC hypothesis before a few additional examinations have been made.

As the most common codebreaking methods don’t work for these two cryptograms, it makes sense to look out for less common ones. The book Cryptanalysis (1939) by Helen Fouché Gaines introduces several of these.

Two of these tools are based on so-called “contacts”. The contacts of a letter in a certain text are the letters that are located directly before or after it appears (without a space in between).

If we look at the following message …

`RED AND GOLD ARE ROYAL COLOURS`

… the letter A has the following contacts: N, R, Y, L. The contacts of D are E and L. E has the contacts R and D (duplicates are not counted).

The number of contacts (without counting duplicates) a letter has in a certain text is named the variety of contacts. In the example above A has a variety of contacts of 4. The veariety of contacts of E is 2.

In an ordinary English text (the same holds for all other common languages) some letters are known to have a low variety of contacts. For instance, the Q is especially often contacted by the U. Other letters tend to have a greater variety of contacts. Usually, vowels have a higher variety of contacts than consonants.

As Fouché Gaines writes, in an (English) text of about 100 letters, most vowels “step up”. This means that the variety of contacts is higher than the number of appearances (in our example message the A steps up, as it appears three times and has four contacts). Consonants usually step down in a text of this length.

All these facts about contacts can be very helpful for breaking difficult MASCs. I can therefore only recommend reading the respective chapters (IX and X) of Fouché Gaines’ book. It is available online.

### Formula needed

What Fouché Gaines does not describe is a measure for the contact variety. Counting the contacts alone is not enough, as this number is dependent from the length of the text. A number like contacts per 100 letters is useless, as the number of contacts is not proportional to the text length.

The information that a letter steps up in a cryptogram of 100 letters is worthless for a longer text (in a very long textevery letter steps down, as the number of appearances grows linearly with the text length, while the variety of contacts never exceeds 26).

So, what we need is a formula for the contact variety of a letter in a certain text. The result of this formula must be (statistically) independent from the text length. It must return a high value, if the variety of contacts is high, otherwise it must return a low value.

Once we have such a formula, we can apply it in different ways:

• We can calculate the contact variety measure for all letters in a typical English text.
• We can calculate the contact variety measure of left and right contacts in a typical English text separately.
• We can calculate the contact variety measures for texts in other languages.

All these values can be compared with the variety measures of an encrypted text. This might help to break it.

How can a formula for the contact variety look like? Any suggestions are welcome.

Further reading: Bigram substitution: An old and simple encryption algorithm that is hard to break

## Kommentare (28)

1. #1 Dampier
22. Dezember 2017

Very well explained, thank you. Sounds exciting to me.
Sorry I can’t provide a formula 😉

2. #2 Dampier
22. Dezember 2017

Wouldn’t it be more an algorithm than a formula in the end?

3. #3 Klaus Schmeh
22. Dezember 2017

@Dampier:
>Wouldn’t it be more an algorithm
>than a formula in the end?
Both is possible. The formula or the algorithm should return a number that rates the contact variety of a certain letter.

4. #4 Thomas
22. Dezember 2017

If the ciphertext is shorter than the unicity distance (which is 28 for a MASC) even the contact analysis won’t work. The general problem is that statistical methods aren’t reliable if there isn’t enough text to work on.

5. #5 David Wilson
22. Dezember 2017

Shouldn’t it be linear?

For example, there are 13 E letters in 100 plaintext letters. They should still have 26 letters next to them (not necessarily unique).

There should be 26 E letters in 200 plaintext letters (twice as many). Shouldn’t they have 52 letters next to them?

The problem is that you’re trying to find unique letters, which is not a linear formula.

6. #6 Thomas
22. Dezember 2017

@David Wilson
The variety of contacts cannot exceed 25.

7. #7 Klaus Schmeh
22. Dezember 2017

@David Wilson
>Shouldn’t they have 52 letters next to them?
This is correct if we count the duplicates. However, this doesn’t help us, as every letter in a text has about the same number of neighbors if we include duplicates. It’s not the total number of contacts, but the variety that counts.

8. #8 TWO
22. Dezember 2017

Very impressive Klaus.

But the creator of such a text can anticipate your method and artificially poison the entropy to mislead you.

In my humble opinion you need to include the formula for average word length (1) too.

How many words are there on average in a English text that is 100 letters long?

(1) and in longer text you have to include average sentence , paragraph, chapter length too.

9. #9 Klaus Schmeh
22. Dezember 2017

@TWO
>But the creator of such a text can anticipate
>your method and artificially poison the entropy
This is correct, but I don’t think that the creators of the cigaret case cryptogram or the seal cryptogram thought of this.

10. #10 David Oranchak
http://zodiackillerciphers.com
22. Dezember 2017

1) Count up unique contacts per letter
2) Add up all those counts
3) Divide the sum by total length of the text

Then you would have an average contact variety per letter.

Or, maybe it is better to take the sum, divide by the alphabet size, THEN divide by text length.

11. #11 David Wilson
22. Dezember 2017

Perhaps instead of contacts, what about digraphs? Isn’t that relatively the same concept?

12. #12 Kai
22. Dezember 2017

Is it not possible to solve such MASCs with plain computation power? At least if there are whitespaces it should be easy: For each crypto word you insert a list of all possible english word explanations with same length. Then you connect two explanations if they are compatible (explain the same letters). Now you search for a maximum number of words that are compatible to each other. Should be easy for a modern computer.

13. #13 Klaus Schmeh
22. Dezember 2017

@Kai:
Sounds like an interesting method. I have never seen it implemented anywhere.

14. #14 Jim
23. Dezember 2017

@Kai That’s how I approach the short MASCs at http://contestcen.com/crypt.htm . There’s often a tradeoff between the number of words that can match a pattern versus including the obscure words the puzzle-setters select.

15. #15 David Oranchak
http://zodiackillerciphers.com
23. Dezember 2017

@Kai and @Klaus:

The method @Kai suggests has a good implementation in Edwin Olson’s paper, “Robust dictionary attack of short simple substitution ciphers”.

http://ai2-s2-pdfs.s3.amazonaws.com/0e0c/c38b65009dfe32b0bdc04a6c34a600c3fe33.pdf

There’s an online implementation here:

https://quipqiup.com/

Seems to work fine with English language ciphers as long as there aren’t many misspellings and other oddities. Would be nice to expand the implementation to include other languages.

Richard Heathfield made a really nice interactive command-line tool which has a similar idea:

16. #16 George Lasry
23. Dezember 2017

Measuring variety of contacts – per each letter – could be done in ways that are similar to measuring the ‘variety’ of monograms, that is, either:
– index of coincidence of the letters following (or preceding) the given letter.
– entropy of the letters following (or preceding) the given letter

This should be done on a corpus of the language for reference, as well as on the cryptogram.

Then the numbers can be compared for guessing letters assignments, same as is done with simple monogram frequncies, to start solving the cryptogram. But this would be useful only for initial guesses. To make further progress, other methods are required. For example, using the cross-entropy of the variety of contacts (which is equivalent to the cross entropy of conditional bigrams)

17. #17 George Lasry
23. Dezember 2017

both the index of coincindence and cross entropy measures are not dependent on the cryptogram length, although their variance (error range) siginificantly increases with shorter cryptograms, for which they are less usefully. More robust methods such as trigrams or even pentagrams would be more useful for shorter cryptograms.

18. #18 Thomas
23. Dezember 2017

@George Lasry
But don’t you think that that statistical methods like a variety of contact analysis will reach their limits if the ciphertext is very short? I think in this case, if any, only a pattern based approach could yield satisfactory results.

19. #19 Rossignol
Paris, France
23. Dezember 2017

If the crypto is very short, we can try to insert spaces to delimit words and
launch a dictionary attack for each case.

For a crypto of say 25 characters, there are
24 ways to insert one space (2 words)
276 ways to insert 2 spaces (3 words)
2024 ways to insert 3 spaces(4 words)
10626 ways to insert 4 spaces(5 words)

This method is time consuming.

But its main drawback is that it gives a lot of solutions and finding the
good one is like looking for a needle in a haystack.

20. #20 Tobi
23. Dezember 2017

I believe the best algorythm would be:

1. Get the number of times every other letter appears as contact letter to another (so you know for example that D is preceded twice by N or something like that)

2. Divide the contact frequency of each letter to each letter by the number of times the contacted letter appears times 2.

That should give you a definitive footprint for every letter about hiw likely any other letter is to appear as it’s contact letter.

21. #21 Tobi
23. Dezember 2017

Though it is probably better to handle pre-contacts and post-contacts differently.

22. #22 George Lasry
23. Dezember 2017

@thomas
Indeed. Not only it is meaningless for short cryptograms, but its value is limited in only providing initial guesses, unless you have a VERY long cryptogram, say, several thousand.

@tobi
Agree, you would definitely get more information from both, vs. from just pre or post.

23. #23 Klaus Schmeh
23. Dezember 2017

The contact-based method described by Fouché Gaines (it used so-called consonant lines) works suprisingly well. I used it to analyse a 87 letter cryptogram. It successfully separated vowels from consonants and correctly identified the letters H and N (based on contact information only). In my opinion, the consonant line methods is an interesting alternative to other MASC breaking tools. Applying it on difficult cases (like the Dorabella Cryptogram) makes sense.

24. #24 Thomas Ernst
Pittsburgh
24. Dezember 2017

If it hadn’t been for the “Dorabella”-mention, I wouldn’t comment at all. ‘Cuz I don’t know much about algorithms. But to anyone out there still trying to solve the “Dorabella”: it’s a fake, fraud – what-have-you. It is a make-belief-cipher. – Life is too short. Don’t waste it on a) the VMS, b) the “Dorabella”. And have a Merry Christmas without either!

25. #25 Elmar Vogt
24. Dezember 2017

I don’t really understand the benefits of the “contact method”. I’d presume that, like frequency analysis, the shorter the cryptogram gets, the more “blurry” and “degenerate” the contact results would be — after all, they relay on frequency counts as well, so they should be of limited use for short messages.
Wouldn’t it be more promising to introduce “hard” criteria into the equation (like “q” is always followed by “u”, “c” is almost always followed by “h” or “k” in German), since these do *not* degrade for short cryptograms?
Or using additional frequency aids (word-initial and word-terminal frequencies for letters, as opposed to mid-word frequencies, letter pair frequencies, etc.)?

26. #26 Klaus Schmeh
24. Dezember 2017

@Elmar:
>I don’t really understand the benefits
>of the “contact method”.
As I said, the contact-based method described by Fouché Gaines works surprisingly well for a 87 letter message.

>Wouldn’t it be more promising
>to introduce “hard” criteria into the equation
The method in question uses two facts about the English language:
– The letter N is rarely contacted by a consonant on the left. If it is, it’s usually an R (e.g. CORN).
– The letter H is rarely contacted by a consonant on the right. If it is, it’s usually an R (e.g. THREE).
In a 87 letter text it is definitely possible to identify the N and the H this way.

>word-initial and word-terminal frequencies
Word initial and word terminal frequencies are certainly helpful if the spaces between the words are visible. However, this is not always the case (e.g., for the Dorabella Cryptogram).

27. #27 Klaus Schmeh
24. Dezember 2017