# Solve this bigram challenge and set a new world record

Blog reader Norbert Biermann has recently solved a bigram substitution ciphertext consisting of 1346 letters – the shortest one ever broken. Here’s a 1000-letter ciphertext of the same kind.

The bigram substitution is a manual encryption method with a history of over 500 years. A bigram (also known as a digraph) is a pair of letters, such as CG, HE, JS or QW. The number of bigrams in the Latin alphabet is 26×26=676, ranging from AA to ZZ. A bigram substitution replaces each letter pair with another one (or with a symbol or with a number between 1 and 676). In order to use a bigram substitution, we need a substitution table with 676 entries.

### Porta’s bigram substitution

The oldest bigram substitution I am aware of is described in the book De Furtivis Literarum Notis written by 16th century cryptologist Giambattista della Porta. Porta uses a 20 letter alphabet. He therefore needs a substitution table with 400 entries. Here it is:

As can be seen, Porta substituted each letter pair with a symbol. He had to be quite inventive to come up with 400 different symbols. For instance, the bigram IA is replaced with a symbol that looks like an X. The bigram VO is substituted with something resembling an O.

### Vigenère’s bigram substitution

Blaise de Vigenère invented a bigram substitution, too. Here’s his table:

Vigenère replaces each bigram with a single letter or a letter followed by a dot, colon or semicolon. E.g., LM is substituted with “r.”.

### RSHA bigram substitution

The following bigram substitution, which is described in David Kahn’s book The Codebreakers, was used by the Nazi authority Reichssicherheitshauptamt (RSHA):

### Two challenges

As far as I can tell, hill cimbing is the best approach to attack a bigram substitution. The best way to implement the fitness function of a bigram hill climber is probably to use hexagram (letter six-tuple) frequencies or some similar means.

However, hill climbing will only be successful if there is enough material to analyze, i.e., if the ciphertext is long enough. But how much ciphertext is necessary to break a bigram substitution? Not much has been published about this question in the literature. Three years ago, I decided to go a first step in finding the answer. For this purpose, I took two messages – one with 2500 and one with 5000 letters – and encrypted them. Subsequently, I published the two ciphertexts as challenges on my blog.

As usual, my readers solved both challenges within a few days. However, this time things proved a little more difficult than in most cases. Blog reader Norbert Biermann found the solution of the 5000 letters version – still with a few mistakes – using hill climbing. Thomas Ernst published a few interesting word pattern considerations. Then Norbert provided a second, more sophisticated hillclimbing result, which was almost error-free. Finally, Armin Krauß published the correct solution.

After the solution of the 5000 letter challenge had proven quite difficult, I expected that the 2500 letter ciphertext would not be solved so soon. However, I was wrong. Only a few days later, Norbert Biermann published the correct solution of the 2500 letter challenge, which he again had found with his hill climber. To my knowledge, this success represented the world record in breaking bigram substitutions.

### The bigram 1346 challenge

Two years after Norbert’s record, I published another bigram challenge on this blog. This time, I took an English text constisting of 1346 letters as plaintext, calling it Bigram 1346 challenge. Contrary to last time, I didn’t replace bigrams with numbers but with other bigrams. For this reason, the ciphertext consisted of letters, which had to be read pair-wise. Here’s the challenge:

UNGOZIHIJGSLGVWPIVGJSOKEFMAHSDBDGLUBUNZIWPIEBIUNKFVOUN
BDSLPPHELVAQBAHEBIFJMHKVFLHXQQEFSLQQBDAQRIBVBIBYGJMOSOZ
BSDUXZINXUNEQVKUGYHUNVOWPSGSMGEFLFKRUHELFPHGVUXFGHRJ
FPNXUNTUUKKIZBKFABEQNHRFWLKCYHDJHJOPBZRLAHQVFTHETGRQRJ
TDAYDTXTVDBDKFEFZKSDHETUFVIQBIYABDEXZIKCHXRUKQRLGECJAQA
OBKZIOBTEFMFRZNZACLWDWAUNEBBISLMREQKWRJRCUGHERGJMXON
WGJHIPBEYGDZOHXIXKFOXFLKVRUDWAOBIDLSRRICSICJGKFZBBUMRFM
GQKFYBXOHETGHEOAMUEMWLAYRJWPKIGXUOSKZIHIJGSLHIGIBLFUXX
UKUQPHGEGWHIOPZIBDLVBUKQRJSMUGFPWLWPSGPHFIKVYXCJLVULK
VQSZIBDLVBUUISRRUGJAIYXXGLKQXFRPBUOJIBTGHGDTGRCICUNVOWP
SGDTLIEMTAVOUNVUQQKCRTLHBDAIUXFGOAKPBTKVFLAHVWRHWAUG
KCXGUNFGBIXAGJHEHXGLBIRNGDOEBPUQSGBDIKACVORUBLKVVLZIHIJ
GSLWPIEUNCEPHFOMKFVMHUGNOPKBGLKCGCLHEXFAYUOMKTAGDZO
HEKFLVBWKVPLGXBPHEGIXOTJWURUUNCLSOFMKVWGFMFPIKGJLLJYO
GDWFRGLFQQEYDFVCAHYZPJGKFBIRQHEBAHELFHEFSVUCLWDBWIGJG
DRAYVKFPWLZNLQFGGJQAKFBLFUWQPWGJOBSLRIVVBXBDVUMHYIZY
BZKFGQLWROZIOBMACTPHSMGECLFGVOSGUNTUJYOGBIHECERCUGW
DEMFGKPAYSQKCBWONQVGEKVDHBIDWPHEMGEAQAONOEXZIOBMA
UNTUPHFYNXXGUNFGBIFNFMMFOXJBBDCLBIBIFJMHKVFLFJMHXNHXV
KUNKFZBTFFMHMFVLVWLYHHEMFOFICOJVUYXMFZNWLLWICLVSDZIFS
EBUNNXVUFIHARCXOZKMMFPKVRUUNVOWPSGLCQWUGCECJTGHIVK
WAFLPHAQVKPHHJSGMFHMRLDDHJZUBPTDBOFGVOSGUNTUJYOGXG
UNFGBIXAFMWAAQFPAIEQQQKCFIHAQWFIBPGLUBZNNNSOWDXMXG
UNFGBIEAHAKCAHAQSGRLMSKFWDBAHEFVMHWGPHFYBIUNTUJYOG
AIYXWAZIFLYNKCRUSOKVLKBOWARIBIHJ

In August 2019, Norbert Biermann published the solution of the bigram 1346 challenge as a comment on my blog. With this success, Norbert set a new world record for the shortest bigram ciphertext ever broken. To my knowledge, this record is still valid today.

### The Bigram 1000 Challenge

After Norbert had broken the bigram 1346 message, I decided to create a new, even shorter challenge. This time, I took a message with exactly 1000 letters. I encrypted it in the same way as the bigram 1346 challenge plaintext, calling it Bigram 1000 Challenge. Here’s the ciphertext:

IBFWNUNFEBVGDQBVTTMHLHUDVVZYBSKCWCUJNSCQYCXNEBSVFD
IYWCKZHKCDSUQBKBBBCYSIYYWMOVDLQXSIQXUGEEKBVOEEVJXYSE
MCUURBLXOVMSEBBFIBTBYFJMMERNVBRWQBIPUGEKJNZJPEBTVWW
KVVIBMEXIVHMERZXHCWWKIPACKHJNTEHKSVBNPYSOXYQOUUQMA
GVWSVLPBFBFHHXHLFJNVJKYTUSIBVBBMEJIDLFPYFRQCGDAEQZJVUZ
TWKEDEHYOKKWRBJVWNVZJUUBBBFXHBBEQHMRSRJDVLJSVMHQXB
FMKKEVRACDXGXVIBOXAMCAIOBQLIBFWEHYOVAXHVJRBDBHKSEHEU
URNIBPVFQKEKDNXDZBSUDBFKHYYTEHKKIIPUGVWWKVVBNIBFWXHV
JRBDBHKDHBXUUSVXHEQNZEFAIKHHIWKZERXKFBQYQZMTUNXOSXA
QXHRJJRQSUIBLINXWUKCUGEKJNZJPEBTVWWKVVMEFGBBVRBTXVGB
BXGFVRCWBBWCBSBJBITDRQJIJIRVZEKBWMUGRTTOHRQEVRACIBBUQ
XXZXFXYQOUUSEJAEOZJNBXHNLACZJTUSIBVBNWRIQLFKBVIKJTDHBZ
JOOIAONWAEKKEBOUGEKJNZJPEBTVWWKVVBFMSRTDUAIBZATCYEQ
HNUGPEBFUNRJHKSVTCRQCYBBTDRJHKJNXYXHMDZJUUBBMEXYKUU
SGBWRFGWMFQPYUGVIXHCWBBZNUGJYBUVVSIVBZSUGBSBJBITDRQ
GXTUBFAGWAPQRVAMACIBGKXZXVKJBNIBPVNSGWTCSUVGORYOTDI
BGKXPQXMUGKIWVRBJQXUGEKJNZJPEBTVWWKVVCDQXKZNCACHKL
DUUIBNITJCGYCXNDLVIVOQXBXLFPEAGKBUGEHWUKBBSHKSVZEQXU
GRQCGKHQXBZYYWMTRRXIBTUBQNVNCRTJORTVSNXTUSIBVTRVVFD

Can a reader break this challenge? If so, he or she will set a new record.

Further reading: Can you solve this Cold War encryption challenge?

## Kommentare (16)

1. #1 TWO
Aviano
8. Oktober 2019

Wo ist Norbert?

2. #2 Richard Bean
Brisbane
9. Oktober 2019

Deavours (1977) in “Unicity Points in Cryptanalysis” estimated that 1461 letters was the unicity point for random digraphic substitution.

i.e. you can fire up R and type “library(gmp); log10(factorialZ(26^2))*.9” and it gives you 1459.

He gets it from his formula “the number of valid English plaintext segments [of N characters] has been found to be approximately 10^(0.30*N) … 0.30 is called the _entropy_ per letter”. Then p = “the probability of getting a meaningful message = the number of meaningful messages divided by the total number of possible source messages”

p = 10^(0.30*N)/26^N = 10^(-1.11*N).

because -1.11 = 0.3-log10(26)

Then the 0.90 is the reciprocal of that 1.11. And his “redundancy in digits” is 1.11/log10(26) = 0.78, so 78% redundancy.

So to update his terminology, if your estimate of English entropy in bits per letter is x, then your unicity distance estimate for this kind of cipher is

log10((26^2)!)/(log10(26)-x/8)
=
1621.3/(1.415-x/8)

and Deavours had x = 2.4 (just like you, he’s got 26 characters, no spaces) should have got 1454

To get an update of an estimate for x we can go to a real-money text compression contest at https://mattmahoney.net/dc/text.html based on 10^9 characters

The best anyone can do is about 8*.115714367 = .926 bits per character

so the unicity estimate would be 1252 letters.

So 1346 letters is solvable depending on the nature of the plaintext, great job Norbert! but 1000 shouldn’t be.

I leave it to Dr. George Lasry to comment on the link between message length and entropy, the link between digraphic IC and entropy etc 🙂

Deavours did write “This value [1.11] is not reached until more than N=20 letter groups are used in the entropy calculation for English.” and has some hand-drawn curves with an estimation vs length.

3. #3 Thomas
9. Oktober 2019

@Richard Bean
Wouldn’t the occurence of recognizable word patterns increase the unicity distance? If so, the cryptogram’s length would be no reliable evidence for unsolvability.

4. #4 Narga
9. Oktober 2019

I don’t see the connection between unicity distance and unsolvability at all.

The point of unicity distance (as I see it) is about some ciphertext being uniquely decipherable. That’s very different from the solvable/unsolvable issue. Take a look at this short ciphertext encrypted with a simple substitution: IFMMZ. It’s very easily solvable, but being shorter than the unicity distance means that there are (likely) several possible solutions (blood, queen, happy, hello…) in the respective language that the solver would all discover if he brute-forces all keys.

With the challenge ciphertexts getting shorter and shorter, it becomes actually EASIER to find “a” matching solution as there are less constraints. You just can’t be sure you found the right one. So I am sometimes a bit puzzled about the point of Klaus’ record breaking challenges, but really love them at the same time 🙂

These challenges here are super hard to solve because there is not enough statistics or leverage (yes, due to the short length) to give n-gram score hillclimbers, basinhoppers, or simulated annealing code a uniquely identifiable maximum with enough surrounding “foothill” area to be found. But If one could actually brute force all keys, and look at all the results, only then the effect of the unicity distance would come into play by not being able to identify the right solution. (Though the decision between “ATTACKATDAWN” and “MONEYISGREAT” can be sometimes easy depending on the context.)

What do you guys think?

5. #5 Richard Bean
Brisbane
10. Oktober 2019

“One key to solving an unknown cipher is a positive attitude: if you believe you have a good chance of breaking it, you may well be correct. If you believe you will not be able to crack it, you are almost certainly correct.” – Jim Gillogly, “Decoding the IRA”

“That’s the spirit!” – Roy Batty, “Blade Runner”, Los Angeles 2019

With this attitude, I guess finding a solution instead of “the” solution is easier with 1,000 letters than with 1,250 letters.

I feel very uneasy about the T. E. Wood book cipher – https://scienceblogs.de/klausis-krypto-kolumne/2017/04/17/the-top-50-unsolved-encrypted-messages-35-cryptograms-from-the-crypt/ – 21 letter book cipher and the plaintext uses several different languages. What’s that, four or five words? On the other hand, they’d have to be very common words to be recognizably words.

6. #6 George Lasry
10. Oktober 2019

Actually, if the solution is ambiguous (multiple plausible decryptions), then this definitely makes the recovery of the original text and key much harder.

I think we need to differentiate between solutions based on statistical methods, vs. solutions based on crib guessing or pattern guessing.

For solutions based on statistical methods such as hillclimbing or simulated annealing using ngrams, a few points (more details in my PhD thesis):
– The closer to the unicity distance(UD), the more ‘spurious’ scores you get (spurious in the sense that low quality keys get high scores, often higher than the score for the original key/text).
– In Reeds’s paper, he introduces the notion of UD adapted for various scoring functions. For example, the UD for bigrams is longer than for trigrams.
– From this, this implies that to overcome spurious scores, we need higher-level n-grams (this is why 5-grams or 6-grams are often used against short ciphertexts). But then we have the problem that higher-level ngrams have low ‘resilience against key errors’ (see my thesis for a semi-formal definition), which means that you need to have acquired a significant part of the key (correct elements) before the score is sensitive enough to the small change made during search.
– For some cipher, the known best methods are effective only for texts much longer than the UD. For some, they are approaching the UD limits.

It is not clear, however, whether and how UD affects methods that are not statistical in nature, like pattern matching.

7. #7 Jarl Van Eycke and Louie Helm
Belgium
14. Oktober 2019

THE MOST FAMOUS UNSOLVED CRYPTOGRAM THAT LOOKS LIKE A
MONOALPHABETICAL SUBSTITUTION NOT TO MENTION THE MOST
FAMOUS UNSOLVED CRYPTOGRAM IN THE WORLD IS BY NO DOUBT
THE MILLER MANUSCRIPT THIS WORK IS A HANDWRITTEN COLLECTION
OF TWO HUNDRED PAGES CONTAINING MANY ILLUSTRATIONS IT IS
NAMED FOR BOOK DEALER WILFRID MILLER WHO DISCOVERED IT IN
AN ITALIAN ?JESUIT CONVENT? IN NINETEEN HUNDRED TWELVE
TODAY THE MILLER MANUSCRIPT IS OWNED BY THE BEINECKE
LIBRARY IN CONNECTICUT THE SCRIPT OF THE MANUSCRIPT IS
BASED ON AN ALPHABET COMPRISING APPROXIMATELY TWENTY
SYMBOLS THE VELLUM THE MILLER MANUSCRIPT IS WRITTEN UPON
WAS DATED WITH A RADIO CARBON ANALYSIS TO THE EARLY
SIXTEENTH CENTURY HUNDREDS OF EXPERTS AND GENERATIONS OF
HOBBYIST RESEARCHERS HAVE EXAMINED THE MILLER MANUSCRIPT
IN GREAT DETAIL BUT ALL THE MAIN QUESTIONS ABOUT IT ARE
STILL UNANSWERED IT IS UNKNOWN WHO WROTE IT WHERE AND WITH
THE EXCEPTION OF THE RADIO CARBON DATING EXACTLY WHEN THE
PURPOSE OF THE BOOK IS ALSO UNCLEAR THE POINTS DEPICTED IN
THE MILLER MANUSCRIPT CANT BE IDENTIFIED THEY LOOK LIKE MERE
FANTASY IMAGES THE ILLUSTRATIONS IN THE BOOK CONTAIN NOTHING
THAT PROVIDE A CLEAR RELATIONSHIP TO ANY SPECIFIC PLACES OR TIME

8. #8 Jarl Van Eycke and Louie Helm
Belgium
15. Oktober 2019

POINTS = PRINTS

9. #9 Richard Bean
Brisbane
16. Oktober 2019

Looks correct, congratulations!

I wonder if this text is particularly “redundant” compared to “average” English text in any way.

10. #10 Thomas
16. Oktober 2019

Congratulations!

Klaus: Are you going to complete your encrypted book list since it doesn’t yet contain this allegedly famous manuscript?

11. #11 Klaus Schmeh
16. Oktober 2019

@Thomas: No 😉
This manuscript doesn’t exist. Somebody has yet to write it.

12. #12 Dampier
16. Oktober 2019

POINTS = PRINTS

Could it be PLANTS? Fits best in that context I think.

13. #13 Dampier
16. Oktober 2019

Sorry, I forgot: Congratulations! : ]

14. #14 Jarl
Belgium
17. Oktober 2019

Thanks all.

@Richard Bean, I ran some tests versus 200 book excerpts:

Bigram repeats: 0.31% above average.
Trigram repeats: 14.63% above average.
Solver score: 1.26 standard deviations above average.
Longest repeat length: 228.02% above average because the string “THEMOSTFAMOUSUNSOLVEDCRYPTOGRAM” repeats twice. Though in the cipher that repeat is spread out over the 2 rails so cribbing the one does not reveal the other and no cribs were used to solve the cipher.

@Dampier, it could be since the bigram that substitutes LA in PLANTS occurs only once, good call.

15. #15 George Lasry
18. Oktober 2019

Jarl – could you kindly provide some details about your technique? Simulated annealing (if yes, temperature schedule)? Scoring function?

Congratulation on this outstanding achievement!

George

16. #16 Jarl
Belgium
18. Oktober 2019

Thanks George,

Me and Louie are sharing our information with Klaus and I suppose he will make a new post about it. If you then have any more questions we will answer them there if you don’t mind.