A few weeks ago, I introduced a 1000-letter ciphertext created with a bigram substitution. Jarl Van Eycke and Louie Helm have now solved this challenge and set a new world record.

The bigram substitution is a manual encryption method with a history of over five centuries. A bigram (also known as a digraph) is a pair of letters, such as CG, HE, JS or QW. The number of bigrams in the Latin alphabet is 26×26=676, from AA to ZZ. A bigram substitution replaces each letter pair with another one (or with a symbol or with a number between 1 and 676). In order to use a bigram substitution, we need a substitution table with 676 entries.

 

World record challenges

The best method known for breaking a bigram substitution is hill climbing. The fitness function of a bigram hill climber typically uses tetragram (four-letter block) or hexagram (six-letter block) frequencies.

In the literature I am aware of, it is not mentioned how long a bigram substitution ciphertext needs to be in order for a hill climber to be successful. Two years ago, I decided to take a first step towards closing this gap. I took two messages – one with 2500 and one with 5000 letters – and encrypted them with a bigram substitution. Subsequently, I published the two ciphertexts as challenges (Bigram 5000 and Bigram 2500) on this blog. My readers Norbert Biermann and Armin Krauß solved both challenges within a few days.

In July, I published another bigram challenge ciphertext. This one consisted of 1346 letters. A few weeks later, it was again Norbert Biermann who found the solution. This 1346-letters cryptogram was the shortest ever publicly broken, so Norbert had set a new world record.

After Norbert had proven that a 1346-letters bigram ciphertext is solvable, I couldn’t help creating a new challenge. This time, I took a plaintext consisting of 1,000 letters and encrypted it with a bigram substitution. On October 7th, I published the resulting cryptogram on my blog. Here it is:

IBFWNUNFEBVGDQBVTTMHLHUDVVZYBSKCWCUJNSCQYCXNEBSVFD
IYWCKZHKCDSUQBKBBBCYSIYYWMOVDLQXSIQXUGEEKBVOEEVJXYSE
MCUURBLXOVMSEBBFIBTBYFJMMERNVBRWQBIPUGEKJNZJPEBTVWW
KVVIBMEXIVHMERZXHCWWKIPACKHJNTEHKSVBNPYSOXYQOUUQMA
GVWSVLPBFBFHHXHLFJNVJKYTUSIBVBBMEJIDLFPYFRQCGDAEQZJVUZ
TWKEDEHYOKKWRBJVWNVZJUUBBBFXHBBEQHMRSRJDVLJSVMHQXB
FMKKEVRACDXGXVIBOXAMCAIOBQLIBFWEHYOVAXHVJRBDBHKSEHEU
URNIBPVFQKEKDNXDZBSUDBFKHYYTEHKKIIPUGVWWKVVBNIBFWXHV
JRBDBHKDHBXUUSVXHEQNZEFAIKHHIWKZERXKFBQYQZMTUNXOSXA
QXHRJJRQSUIBLINXWUKCUGEKJNZJPEBTVWWKVVMEFGBBVRBTXVGB
BXGFVRCWBBWCBSBJBITDRQJIJIRVZEKBWMUGRTTOHRQEVRACIBBUQ
XXZXFXYQOUUSEJAEOZJNBXHNLACZJTUSIBVBNWRIQLFKBVIKJTDHBZ
JOOIAONWAEKKEBOUGEKJNZJPEBTVWWKVVBFMSRTDUAIBZATCYEQ
HNUGPEBFUNRJHKSVTCRQCYBBTDRJHKJNXYXHMDZJUUBBMEXYKUU
SGBWRFGWMFQPYUGVIXHCWBBZNUGJYBUVVSIVBZSUGBSBJBITDRQ
GXTUBFAGWAPQRVAMACIBGKXZXVKJBNIBPVNSGWTCSUVGORYOTDI
BGKXPQXMUGKIWVRBJQXUGEKJNZJPEBTVWWKVVCDQXKZNCACHKL
DUUIBNITJCGYCXNDLVIVOQXBXLFPEAGKBUGEHWUKBBSHKSVZEQXU
GRQCGKHQXBZYYWMTRRXIBTUBQNVNCRTJORTVSNXTUSIBVTRVVFD
BHADTEUWIWAWCERJYFHKDL

 

A solution setting a new world record

Again, my bigram challenge was broken, which meant that a new world record was set. This time, the solution came from Belgium. Jarl Van Eycke and Louie Helm solved the challenge with sophisticated hill climbing techniques. Here’s the solution the two posted:

THE MOST FAMOUS UNSOLVED CRYPTOGRAM THAT LOOKS LIKE A
MONOALPHABETICAL SUBSTITUTION NOT TO MENTION THE MOST
FAMOUS UNSOLVED CRYPTOGRAM IN THE WORLD IS BY NO DOUBT
THE MILLER MANUSCRIPT THIS WORK IS A HANDWRITTEN COLLECTION
OF TWO HUNDRED PAGES CONTAINING MANY ILLUSTRATIONS IT IS
NAMED FOR BOOK DEALER WILFRID MILLER WHO DISCOVERED IT IN
AN ITALIAN JESUIT CONVENT IN NINETEEN HUNDRED TWELVE
TODAY THE MILLER MANUSCRIPT IS OWNED BY THE BEINECKE
LIBRARY IN CONNECTICUT THE SCRIPT OF THE MANUSCRIPT IS
BASED ON AN ALPHABET COMPRISING APPROXIMATELY TWENTY
SYMBOLS THE VELLUM THE MILLER MANUSCRIPT IS WRITTEN UPON
WAS DATED WITH A RADIO CARBON ANALYSIS TO THE EARLY
SIXTEENTH CENTURY HUNDREDS OF EXPERTS AND GENERATIONS OF
HOBBYIST RESEARCHERS HAVE EXAMINED THE MILLER MANUSCRIPT
IN GREAT DETAIL BUT ALL THE MAIN QUESTIONS ABOUT IT ARE
STILL UNANSWERED IT IS UNKNOWN WHO WROTE IT WHERE AND WITH
THE EXCEPTION OF THE RADIO CARBON DATING EXACTLY WHEN THE
PURPOSE OF THE BOOK IS ALSO UNCLEAR THE POINTS DEPICTED IN
THE MILLER MANUSCRIPT CANT BE IDENTIFIED THEY LOOK LIKE MERE
FANTASY IMAGES THE ILLUSTRATIONS IN THE BOOK CONTAIN NOTHING
THAT PROVIDE A CLEAR RELATIONSHIP TO ANY SPECIFIC PLACES OR TIME

If I’m not wrong, this solution is a hundred percent correct. Sometimes, there are bigrams in such a ciphertext that can’t be decrypted unabiguously, but this appears not to be the case here.

 

How Jarl and Louie solved the challenge

The hill climber that solved the bigram 1000 challenge was written by Jarl Van Eycke. It is a light modification to AZdecrypt’s substitution solver (see below). The hill climbing algorithm Jarl used includes simulated annealing and is very optimized for deciphering bigram substitutions.

Interestingly, the fitness function Jarl used is not based on tetragrams or hexagrams but on octagrams (eight-letter blocks). The octagram statistics required were provided by Louie Helm. Dealing with octagrams is far from trivial, as there are so many. Their total number in a 26-letters alphabet is 26^8, which is approximately 200,000,000,000.

Most naive approaches to handle octgrams require more memory than modern computers have. So, Louie developed a few interesting quirks in how octagrams are compiled, encoded, and loaded into memory.

As the number of octagrams is so high, a huge amount of text is required to compile meaningful reference statistics. Louie used about 2 Terabyte of English text from the following sources:

  • 1.3 million public domain books
  • 4.5 million potentially copyrighted books
  • All reddit comments and submissions
  • All of Wikipedia
  • All Project Gutenberg books
  • All of the subtitles of every movie ever released
  • All lyrics of all songs ever produced
  • 7 billion words from usenet posts
  • 34 million sentences from online news stories
  • 135 million online reviews from Amazon, TripAdvisor, and Yelp
  • 4.4 million Yahoo Answers exchanges
  • Louie’s own re-creation of the OpenAI GPT-2 data model

After chi^2-filtering and tallying, the octagrams were restricted to encodings where both the initial and final four-letter sub-grams were valid English language combinations (~1/4 of all possible combinations). Then they were given eight-bit capped-log-frequency scores and stored in a compressed pointer table and gziped.

The text corpora Louie used provided 2,062,507,743,806 samples of 8,178,871,377 unique octagrams. The subgram structure reduced the final file to 3,631,818,052 “valid” English octagrams, which could be comfortably loaded into 14 GB of memory (instead of 195 GB for a naive implementation).

Louie is currently developing an even better version of his octagram file – with another trillion samples from about 10 billion tweets taken from Twitter. In addition, he uses an encoding improvement to reduce memory needs by another 75 percent (from 14GB to 3.5GB).

Jarl’s program first creates two new symbols for every unique bigram based on the following coding method: JOININGTHEJOINTS = 1 2 3 4 3 4 5 6 7 8 1 2 3 4 9 10. Then it changes the key (i.e., the substitution table) one symbol at a time. Bigram homophones (i.e., several ciphertext bigrams decrypting to the same plaintext bigram) are allowed but punished.

After a few days of optimization, Jarl’s program was able to consistently solve the bigram 1346 challenge (the one already solved by Norbert Biermann) without requiring a crib in less than five minutes.

After this success, Jarl felt confident and applied his program on the bigram 1000 challenge. He started before going to work. When he came back, the software had returned a 70-80% accurate decryption in about 4 hours (using Louie’s octagram statistics). Jarl shared this result with Louie. They cleaned it up and submitted it to my blog. As mentioned above, it proved correct.

 

About Louie and Jarl

Source: Jarl Van Eycke

Jarl Van Eycke is a 38 year old male from Flanders, Belgium. Originally schooled as a graphic designer, he now works as a warehouse operator for Yusen Logistics (15 years there) mostly handling parts for the automotive industry. He started out with cryptography in 2014 trying to solve the unsolved Zodiac 340 cryptogram and it has been a hobby ever since. Jarl is the author of AZdecrypt, a fast and powerful cipher solver, since late 2014.

Jarl has been assisted by Louie with AZdecrypt since early 2019. Louie mainly provided sophisticated n-gram information.

Source: Louie Helm

Louie Helm is an entrepreneur who works on genetics and machine learning day-to-day. Developing high-performance n-gram models is a hobby he picked up a few months ago.

Once again, I’m very proud that readers of this blog have broken a really tough challenge. I congratulate Jarl and Louie on this great success! I want to thank them very much for providing me detailed information about how they broke the challenge. Keep up your great work!


Further reading: The Top 50 unsolved encrypted messages: 7. The cigaret case cryptogram

Linkedin: https://www.linkedin.com/groups/13501820
Facebook: https://www.facebook.com/groups/763282653806483/

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Kommentare (17)

  1. #1 George Lasry
    27. Oktober 2019

    Very impressive, and thanks for the fascinating details. A couple of questions, if I may:
    1) Did you try with lower n-grams (e.g. hexagrams), and if yes, what is your explanation for why 8-grams work better than 6-grams?
    2) Why did you allow homophones (would this attack have worked without homophones), and how did you penalize them?
    3) How did you run Simulated Annealing? Temperature cooling, number of rounds, and which transformations on candidate keys?

    Looking forward to learning more about your impressive achievement which pushes forward the boundaries of what can be done against challenging classical ciphers.

  2. #2 Jarl
    Belgium
    28. Oktober 2019

    Thanks for your interest George.

    1) The 8-grams offered a safe amount of discriminative power for the cipher. I briefly tried with 7-grams and noticed that the scores of the non-solutions were close to that of a typical good English plaintext so I considered that the 7-grams possibly had not enough discriminative power for the cipher and went with 8-grams instead. After solving the cipher and realizing that the plaintext is about 1 standard deviation above average a decryption with 6 or 7-grams may be possible.

    2) Originally out of laziness but after thinking about it I convinced myself that it was not bad as it allows more freedom to the solver to move stuff around. Not sure if it would work without it. The score is penalized by dividing by 1 + homophone_ic_count / magic_number.

    3) Simulated annealing without acceptance function. Linear cooling. Number of rounds/iterations are progressively increased after each attempt/restart and start at 500,000. The key is changed one symbol at a time assigning a new random letter to it.

  3. #3 George Lasry
    28. Oktober 2019

    Many thanks for the details. One more question if I may: what do you mean by ‘without acceptance function’?

  4. #4 Jarl
    Belgium
    28. Oktober 2019

    Like this:

    If new_score > old_score then
    — ‘success, accept new key, old_score = new_score
    Else
    — ‘failure, reduce old_score by current temperature value
    End if

  5. #5 David Oranchak
    http://zodiackillerciphers.com
    28. Oktober 2019

    Fantastic job, Jarl and Louie! Congratulations!

  6. #6 Thomas
    28. Oktober 2019

    Congratulations, outstanding!

    Jarl, if I may ask (since the Azd is linked above, I hope not off-topic):
    Does the Azd homophone cipher solver accept numbers of different length as input?
    What does ‘substitution + polyphones’ mean? Actually a polyphonic cipher in which one cipher character represents different plaintext letters, i.e. a non-deterministic cipher?

  7. #7 Dampier
    28. Oktober 2019

    Congrats for the record!

    Louie used about 2 Terabyte of English text from the following sources:

    What an impressive list! Where can one get all that stuff? Did you have to program an own webcrawler?

    If there is a legal way (an archive on the web or so) I would appreciate a hint. thanx 🙂

  8. #8 TWO
    Gran Canaria
    28. Oktober 2019

    Congratulations!

    So for the 1000 bigram challende we need decagrams?

  9. #9 Jarl
    Belgium
    28. Oktober 2019

    Thanks everyone.

    @Thomas. 1) Not sure what you mean but you can look into the Ciphers directory that comes with AZdecrypt for example ciphers that it accepts. These ciphers are either numeric or symbolic. 2) In general Substitution allows homophones by default for every solver. Yes, a polyphonic cipher in which one cipher symbol can represent multiple letters. If you sign up to Zodiackillersite I can provide a detailed explanation on how to best use the polyphone solver. All of these solvers were designed to test various hypotheses on the Zodiac 340 cipher.

    @Dampier. You can get the Reddit stuff here: https://files.pushshift.io/reddit/comments/ It is about 550 GB of compressed JSON objects.

  10. #10 TWO
    Flying across the pond
    28. Oktober 2019

    Ehm the 750 challenge I mean.

    @Jarl
    Don’t give away too much, your work has a monetary value.

  11. #11 Jarl
    Belgium
    29. Oktober 2019

    @TWO. I haven’t tested but it’s possible that 8-grams may be enough for 750.

  12. #12 TWO
    Gran Canaria
    29. Oktober 2019

    I am sure Klaus will give you the opportunity to test this.

  13. #13 George Lasry
    30. Oktober 2019

    You seem to have made important progress in state-of-the-art techniques for classical ciphers – with the innovative implementation and use of 8-gram and the special simulated annealing technique. This definitely deserves a paper in Cryptologia and/or presenting at a conference like Histocrypt 2019 (there is already a call for paper). Happy to assist in the process of you are interested.

  14. #14 Jarl
    Belgium
    3. November 2019

    George, can we contact you by e-mail?

  15. #15 bE
    6. November 2019

    @Jarl Van Eycke

    Congratulations also from me to you and Louie for breaking Klaus’ “bigram 1000 challenge”.

    In http://scienceblogs.de/klausis-krypto-kolumne/2019/10/31/celebrating-20-years-of-cryptool/ we had a question about “8grams”.

    Would you be willing to support integrating your algorithm and your 8-gram statistics into CT2?
    If so, please don’t hesitate to contact me via email:
    bernhard.esslinger@gmail.com

  16. #16 Jarl
    Belgium
    8. November 2019

    George, Bernhard and anyone else that may be interested,

    Thank you for the honor of considering our work. I’ve discussed it with Louie and we want to keep things light and fun.

    That said feel free to use the AZdecrypt code as you see fit as long as proper attributions are made. We are willing to support by answering questions at jarlve@yahoo.com.

    The latest version can be found at the project page: http://www.zodiackillersite.com/viewtopic.php?f=81&t=3198

    And the latest source can be found at: https://drive.google.com/drive/folders/0B5r0rDAOuzIQd0p1NmljRWJvYkU

    I plan to release a quick update to AZdecrypt this month that includes the bigram solver and a video of how to use it.

  17. #17 Jarl
    Belgium
    9. November 2019

    I’ve made the solver and its source available. Instruction video: https://drive.google.com/open?id=14eBoHPq-7Zqh8Od0i7kJIgpJJXMJ3l21