Blog reader Norbert Biermann has recently solved a bigram substitution ciphertext consisting of 1346 letters – the shortest one ever broken. Here’s a 1000-letter ciphertext of the same kind.

The bigram substitution is a manual encryption method with a history of over 500 years. A bigram (also known as a digraph) is a pair of letters, such as CG, HE, JS or QW. The number of bigrams in the Latin alphabet is 26×26=676, ranging from AA to ZZ. A bigram substitution replaces each letter pair with another one (or with a symbol or with a number between 1 and 676). In order to use a bigram substitution, we need a substitution table with 676 entries.

 

Porta’s bigram substitution

The oldest bigram substitution I am aware of is described in the book De Furtivis Literarum Notis written by 16th century cryptologist Giambattista della Porta. Porta uses a 20 letter alphabet. He therefore needs a substitution table with 400 entries. Here it is:

Bigram-Porta

As can be seen, Porta substituted each letter pair with a symbol. He had to be quite inventive to come up with 400 different symbols. For instance, the bigram IA is replaced with a symbol that looks like an X. The bigram VO is substituted with something resembling an O.

 

Vigenère’s bigram substitution

Blaise de Vigenère invented a bigram substitution, too. Here’s his table:

Bigram-Vigenere

Vigenère replaces each bigram with a single letter or a letter followed by a dot, colon or semicolon. E.g., LM is substituted with “r.”.

 

RSHA bigram substitution

The following bigram substitution, which is described in David Kahn’s book The Codebreakers, was used by the Nazi authority Reichssicherheitshauptamt (RSHA):

Bigram-RSHA

 

Two challenges

As far as I can tell, hill cimbing is the best approach to attack a bigram substitution. The best way to implement the fitness function of a bigram hill climber is probably to use hexagram (letter six-tuple) frequencies or some similar means.

However, hill climbing will only be successful if there is enough material to analyze, i.e., if the ciphertext is long enough. But how much ciphertext is necessary to break a bigram substitution? Not much has been published about this question in the literature. Three years ago, I decided to go a first step in finding the answer. For this purpose, I took two messages – one with 2500 and one with 5000 letters – and encrypted them. Subsequently, I published the two ciphertexts as challenges on my blog.

As usual, my readers solved both challenges within a few days. However, this time things proved a little more difficult than in most cases. Blog reader Norbert Biermann found the solution of the 5000 letters version – still with a few mistakes – using hill climbing. Thomas Ernst published a few interesting word pattern considerations. Then Norbert provided a second, more sophisticated hillclimbing result, which was almost error-free. Finally, Armin Krauß published the correct solution.

After the solution of the 5000 letter challenge had proven quite difficult, I expected that the 2500 letter ciphertext would not be solved so soon. However, I was wrong. Only a few days later, Norbert Biermann published the correct solution of the 2500 letter challenge, which he again had found with his hill climber. To my knowledge, this success represented the world record in breaking bigram substitutions.

 

The bigram 1346 challenge

Two years after Norbert’s record, I published another bigram challenge on this blog. This time, I took an English text constisting of 1346 letters as plaintext, calling it Bigram 1346 challenge. Contrary to last time, I didn’t replace bigrams with numbers but with other bigrams. For this reason, the ciphertext consisted of letters, which had to be read pair-wise. Here’s the challenge:

UNGOZIHIJGSLGVWPIVGJSOKEFMAHSDBDGLUBUNZIWPIEBIUNKFVOUN
BDSLPPHELVAQBAHEBIFJMHKVFLHXQQEFSLQQBDAQRIBVBIBYGJMOSOZ
BSDUXZINXUNEQVKUGYHUNVOWPSGSMGEFLFKRUHELFPHGVUXFGHRJ
YFUHIPBMHUNVOWPSGHXVKRSSGPHPWQXPLKCXGUNFGBICJFGJGCLLC
FPNXUNTUUKKIZBKFABEQNHRFWLKCYHDJHJOPBZRLAHQVFTHETGRQRJ
TDAYDTXTVDBDKFEFZKSDHETUFVIQBIYABDEXZIKCHXRUKQRLGECJAQA
OBKZIOBTEFMFRZNZACLWDWAUNEBBISLMREQKWRJRCUGHERGJMXON
WGJHIPBEYGDZOHXIXKFOXFLKVRUDWAOBIDLSRRICSICJGKFZBBUMRFM
GQKFYBXOHETGHEOAMUEMWLAYRJWPKIGXUOSKZIHIJGSLHIGIBLFUXX
UKUQPHGEGWHIOPZIBDLVBUKQRJSMUGFPWLWPSGPHFIKVYXCJLVULK
VQSZIBDLVBUUISRRUGJAIYXXGLKQXFRPBUOJIBTGHGDTGRCICUNVOWP
SGDTLIEMTAVOUNVUQQKCRTLHBDAIUXFGOAKPBTKVFLAHVWRHWAUG
KCXGUNFGBIXAGJHEHXGLBIRNGDOEBPUQSGBDIKACVORUBLKVVLZIHIJ
GSLWPIEUNCEPHFOMKFVMHUGNOPKBGLKCGCLHEXFAYUOMKTAGDZO
HEKFLVBWKVPLGXBPHEGIXOTJWURUUNCLSOFMKVWGFMFPIKGJLLJYO
GDWFRGLFQQEYDFVCAHYZPJGKFBIRQHEBAHELFHEFSVUCLWDBWIGJG
DRAYVKFPWLZNLQFGGJQAKFBLFUWQPWGJOBSLRIVVBXBDVUMHYIZY
BZKFGQLWROZIOBMACTPHSMGECLFGVOSGUNTUJYOGBIHECERCUGW
DEMFGKPAYSQKCBWONQVGEKVDHBIDWPHEMGEAQAONOEXZIOBMA
UNTUPHFYNXXGUNFGBIFNFMMFOXJBBDCLBIBIFJMHKVFLFJMHXNHXV
KUNKFZBTFFMHMFVLVWLYHHEMFOFICOJVUYXMFZNWLLWICLVSDZIFS
EBUNNXVUFIHARCXOZKMMFPKVRUUNVOWPSGLCQWUGCECJTGHIVK
WAFLPHAQVKPHHJSGMFHMRLDDHJZUBPTDBOFGVOSGUNTUJYOGXG
UNFGBIXAFMWAAQFPAIEQQQKCFIHAQWFIBPGLUBZNNNSOWDXMXG
UNFGBIEAHAKCAHAQSGRLMSKFWDBAHEFVMHWGPHFYBIUNTUJYOG
AIYXWAZIFLYNKCRUSOKVLKBOWARIBIHJ

In August 2019, Norbert Biermann published the solution of the bigram 1346 challenge as a comment on my blog. With this success, Norbert set a new world record for the shortest bigram ciphertext ever broken. To my knowledge, this record is still valid today.

 

The Bigram 1000 Challenge

After Norbert had broken the bigram 1346 message, I decided to create a new, even shorter challenge. This time, I took a message with exactly 1000 letters. I encrypted it in the same way as the bigram 1346 challenge plaintext, calling it Bigram 1000 Challenge. Here’s the ciphertext:

IBFWNUNFEBVGDQBVTTMHLHUDVVZYBSKCWCUJNSCQYCXNEBSVFD
IYWCKZHKCDSUQBKBBBCYSIYYWMOVDLQXSIQXUGEEKBVOEEVJXYSE
MCUURBLXOVMSEBBFIBTBYFJMMERNVBRWQBIPUGEKJNZJPEBTVWW
KVVIBMEXIVHMERZXHCWWKIPACKHJNTEHKSVBNPYSOXYQOUUQMA
GVWSVLPBFBFHHXHLFJNVJKYTUSIBVBBMEJIDLFPYFRQCGDAEQZJVUZ
TWKEDEHYOKKWRBJVWNVZJUUBBBFXHBBEQHMRSRJDVLJSVMHQXB
FMKKEVRACDXGXVIBOXAMCAIOBQLIBFWEHYOVAXHVJRBDBHKSEHEU
URNIBPVFQKEKDNXDZBSUDBFKHYYTEHKKIIPUGVWWKVVBNIBFWXHV
JRBDBHKDHBXUUSVXHEQNZEFAIKHHIWKZERXKFBQYQZMTUNXOSXA
QXHRJJRQSUIBLINXWUKCUGEKJNZJPEBTVWWKVVMEFGBBVRBTXVGB
BXGFVRCWBBWCBSBJBITDRQJIJIRVZEKBWMUGRTTOHRQEVRACIBBUQ
XXZXFXYQOUUSEJAEOZJNBXHNLACZJTUSIBVBNWRIQLFKBVIKJTDHBZ
JOOIAONWAEKKEBOUGEKJNZJPEBTVWWKVVBFMSRTDUAIBZATCYEQ
HNUGPEBFUNRJHKSVTCRQCYBBTDRJHKJNXYXHMDZJUUBBMEXYKUU
SGBWRFGWMFQPYUGVIXHCWBBZNUGJYBUVVSIVBZSUGBSBJBITDRQ
GXTUBFAGWAPQRVAMACIBGKXZXVKJBNIBPVNSGWTCSUVGORYOTDI
BGKXPQXMUGKIWVRBJQXUGEKJNZJPEBTVWWKVVCDQXKZNCACHKL
DUUIBNITJCGYCXNDLVIVOQXBXLFPEAGKBUGEHWUKBBSHKSVZEQXU
GRQCGKHQXBZYYWMTRRXIBTUBQNVNCRTJORTVSNXTUSIBVTRVVFD
BHADTEUWIWAWCERJYFHKDL

Can a reader break this challenge? If so, he or she will set a new record.


Further reading: Can you solve this Cold War encryption challenge?

Linkedin: https://www.linkedin.com/groups/13501820
Facebook: https://www.facebook.com/groups/763282653806483/

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Kommentare (16)

  1. #1 TWO
    Aviano
    8. Oktober 2019

    Wo ist Norbert?

  2. #2 Richard Bean
    Brisbane
    9. Oktober 2019

    Deavours (1977) in “Unicity Points in Cryptanalysis” estimated that 1461 letters was the unicity point for random digraphic substitution.

    i.e. you can fire up R and type “library(gmp); log10(factorialZ(26^2))*.9” and it gives you 1459.

    He gets it from his formula “the number of valid English plaintext segments [of N characters] has been found to be approximately 10^(0.30*N) … 0.30 is called the _entropy_ per letter”. Then p = “the probability of getting a meaningful message = the number of meaningful messages divided by the total number of possible source messages”

    p = 10^(0.30*N)/26^N = 10^(-1.11*N).

    because -1.11 = 0.3-log10(26)

    Then the 0.90 is the reciprocal of that 1.11. And his “redundancy in digits” is 1.11/log10(26) = 0.78, so 78% redundancy.

    So to update his terminology, if your estimate of English entropy in bits per letter is x, then your unicity distance estimate for this kind of cipher is

    log10((26^2)!)/(log10(26)-x/8)
    =
    1621.3/(1.415-x/8)

    and Deavours had x = 2.4 (just like you, he’s got 26 characters, no spaces) should have got 1454

    To get an update of an estimate for x we can go to a real-money text compression contest at http://mattmahoney.net/dc/text.html based on 10^9 characters

    The best anyone can do is about 8*.115714367 = .926 bits per character

    so the unicity estimate would be 1252 letters.

    So 1346 letters is solvable depending on the nature of the plaintext, great job Norbert! but 1000 shouldn’t be.

    I leave it to Dr. George Lasry to comment on the link between message length and entropy, the link between digraphic IC and entropy etc 🙂

    Deavours did write “This value [1.11] is not reached until more than N=20 letter groups are used in the entropy calculation for English.” and has some hand-drawn curves with an estimation vs length.

  3. #3 Thomas
    9. Oktober 2019

    @Richard Bean
    Wouldn’t the occurence of recognizable word patterns increase the unicity distance? If so, the cryptogram’s length would be no reliable evidence for unsolvability.

  4. #4 Narga
    9. Oktober 2019

    I don’t see the connection between unicity distance and unsolvability at all.

    The point of unicity distance (as I see it) is about some ciphertext being uniquely decipherable. That’s very different from the solvable/unsolvable issue. Take a look at this short ciphertext encrypted with a simple substitution: IFMMZ. It’s very easily solvable, but being shorter than the unicity distance means that there are (likely) several possible solutions (blood, queen, happy, hello…) in the respective language that the solver would all discover if he brute-forces all keys.

    With the challenge ciphertexts getting shorter and shorter, it becomes actually EASIER to find “a” matching solution as there are less constraints. You just can’t be sure you found the right one. So I am sometimes a bit puzzled about the point of Klaus’ record breaking challenges, but really love them at the same time 🙂

    These challenges here are super hard to solve because there is not enough statistics or leverage (yes, due to the short length) to give n-gram score hillclimbers, basinhoppers, or simulated annealing code a uniquely identifiable maximum with enough surrounding “foothill” area to be found. But If one could actually brute force all keys, and look at all the results, only then the effect of the unicity distance would come into play by not being able to identify the right solution. (Though the decision between “ATTACKATDAWN” and “MONEYISGREAT” can be sometimes easy depending on the context.)

    What do you guys think?

  5. #5 Richard Bean
    Brisbane
    10. Oktober 2019

    Oh it’s good. I admire your positive attitude!

    “One key to solving an unknown cipher is a positive attitude: if you believe you have a good chance of breaking it, you may well be correct. If you believe you will not be able to crack it, you are almost certainly correct.” – Jim Gillogly, “Decoding the IRA”

    “That’s the spirit!” – Roy Batty, “Blade Runner”, Los Angeles 2019

    With this attitude, I guess finding a solution instead of “the” solution is easier with 1,000 letters than with 1,250 letters.

    I feel very uneasy about the T. E. Wood book cipher – http://scienceblogs.de/klausis-krypto-kolumne/2017/04/17/the-top-50-unsolved-encrypted-messages-35-cryptograms-from-the-crypt/ – 21 letter book cipher and the plaintext uses several different languages. What’s that, four or five words? On the other hand, they’d have to be very common words to be recognizably words.

  6. #6 George Lasry
    10. Oktober 2019

    Actually, if the solution is ambiguous (multiple plausible decryptions), then this definitely makes the recovery of the original text and key much harder.

    I think we need to differentiate between solutions based on statistical methods, vs. solutions based on crib guessing or pattern guessing.

    For solutions based on statistical methods such as hillclimbing or simulated annealing using ngrams, a few points (more details in my PhD thesis):
    – The closer to the unicity distance(UD), the more ‘spurious’ scores you get (spurious in the sense that low quality keys get high scores, often higher than the score for the original key/text).
    – In Reeds’s paper, he introduces the notion of UD adapted for various scoring functions. For example, the UD for bigrams is longer than for trigrams.
    – From this, this implies that to overcome spurious scores, we need higher-level n-grams (this is why 5-grams or 6-grams are often used against short ciphertexts). But then we have the problem that higher-level ngrams have low ‘resilience against key errors’ (see my thesis for a semi-formal definition), which means that you need to have acquired a significant part of the key (correct elements) before the score is sensitive enough to the small change made during search.
    – For some cipher, the known best methods are effective only for texts much longer than the UD. For some, they are approaching the UD limits.

    It is not clear, however, whether and how UD affects methods that are not statistical in nature, like pattern matching.

  7. #7 Jarl Van Eycke and Louie Helm
    Belgium
    14. Oktober 2019

    THE MOST FAMOUS UNSOLVED CRYPTOGRAM THAT LOOKS LIKE A
    MONOALPHABETICAL SUBSTITUTION NOT TO MENTION THE MOST
    FAMOUS UNSOLVED CRYPTOGRAM IN THE WORLD IS BY NO DOUBT
    THE MILLER MANUSCRIPT THIS WORK IS A HANDWRITTEN COLLECTION
    OF TWO HUNDRED PAGES CONTAINING MANY ILLUSTRATIONS IT IS
    NAMED FOR BOOK DEALER WILFRID MILLER WHO DISCOVERED IT IN
    AN ITALIAN ?JESUIT CONVENT? IN NINETEEN HUNDRED TWELVE
    TODAY THE MILLER MANUSCRIPT IS OWNED BY THE BEINECKE
    LIBRARY IN CONNECTICUT THE SCRIPT OF THE MANUSCRIPT IS
    BASED ON AN ALPHABET COMPRISING APPROXIMATELY TWENTY
    SYMBOLS THE VELLUM THE MILLER MANUSCRIPT IS WRITTEN UPON
    WAS DATED WITH A RADIO CARBON ANALYSIS TO THE EARLY
    SIXTEENTH CENTURY HUNDREDS OF EXPERTS AND GENERATIONS OF
    HOBBYIST RESEARCHERS HAVE EXAMINED THE MILLER MANUSCRIPT
    IN GREAT DETAIL BUT ALL THE MAIN QUESTIONS ABOUT IT ARE
    STILL UNANSWERED IT IS UNKNOWN WHO WROTE IT WHERE AND WITH
    THE EXCEPTION OF THE RADIO CARBON DATING EXACTLY WHEN THE
    PURPOSE OF THE BOOK IS ALSO UNCLEAR THE POINTS DEPICTED IN
    THE MILLER MANUSCRIPT CANT BE IDENTIFIED THEY LOOK LIKE MERE
    FANTASY IMAGES THE ILLUSTRATIONS IN THE BOOK CONTAIN NOTHING
    THAT PROVIDE A CLEAR RELATIONSHIP TO ANY SPECIFIC PLACES OR TIME

  8. #8 Jarl Van Eycke and Louie Helm
    Belgium
    15. Oktober 2019

    Small addendum:

    POINTS = PRINTS

  9. #9 Richard Bean
    Brisbane
    16. Oktober 2019

    Looks correct, congratulations!

    I wonder if this text is particularly “redundant” compared to “average” English text in any way.

  10. #10 Thomas
    16. Oktober 2019

    Congratulations!

    Klaus: Are you going to complete your encrypted book list since it doesn’t yet contain this allegedly famous manuscript?

  11. #11 Klaus Schmeh
    16. Oktober 2019

    @Thomas: No 😉
    This manuscript doesn’t exist. Somebody has yet to write it.

  12. #12 Dampier
    16. Oktober 2019

    Small addendum:
    POINTS = PRINTS

    Could it be PLANTS? Fits best in that context I think.

  13. #13 Dampier
    16. Oktober 2019

    Sorry, I forgot: Congratulations! : ]

  14. #14 Jarl
    Belgium
    17. Oktober 2019

    Thanks all.

    @Richard Bean, I ran some tests versus 200 book excerpts:

    Bigram repeats: 0.31% above average.
    Trigram repeats: 14.63% above average.
    Solver score: 1.26 standard deviations above average.
    Longest repeat length: 228.02% above average because the string “THEMOSTFAMOUSUNSOLVEDCRYPTOGRAM” repeats twice. Though in the cipher that repeat is spread out over the 2 rails so cribbing the one does not reveal the other and no cribs were used to solve the cipher.

    @Dampier, it could be since the bigram that substitutes LA in PLANTS occurs only once, good call.

  15. #15 George Lasry
    18. Oktober 2019

    Jarl – could you kindly provide some details about your technique? Simulated annealing (if yes, temperature schedule)? Scoring function?

    Congratulation on this outstanding achievement!

    George

  16. #16 Jarl
    Belgium
    18. Oktober 2019

    Thanks George,

    Me and Louie are sharing our information with Klaus and I suppose he will make a new post about it. If you then have any more questions we will answer them there if you don’t mind.