A few weeks ago, I introduced a 1000-letter ciphertext created with a bigram substitution. Jarl Van Eycke and Louie Helm have now solved this challenge and set a new world record.
The bigram substitution is a manual encryption method with a history of over five centuries. A bigram (also known as a digraph) is a pair of letters, such as CG, HE, JS or QW. The number of bigrams in the Latin alphabet is 26×26=676, from AA to ZZ. A bigram substitution replaces each letter pair with another one (or with a symbol or with a number between 1 and 676). In order to use a bigram substitution, we need a substitution table with 676 entries.
World record challenges
The best method known for breaking a bigram substitution is hill climbing. The fitness function of a bigram hill climber typically uses tetragram (four-letter block) or hexagram (six-letter block) frequencies.
In the literature I am aware of, it is not mentioned how long a bigram substitution ciphertext needs to be in order for a hill climber to be successful. Two years ago, I decided to take a first step towards closing this gap. I took two messages – one with 2500 and one with 5000 letters – and encrypted them with a bigram substitution. Subsequently, I published the two ciphertexts as challenges (Bigram 5000 and Bigram 2500) on this blog. My readers Norbert Biermann and Armin Krauß solved both challenges within a few days.
In July, I published another bigram challenge ciphertext. This one consisted of 1346 letters. A few weeks later, it was again Norbert Biermann who found the solution. This 1346-letters cryptogram was the shortest ever publicly broken, so Norbert had set a new world record.
After Norbert had proven that a 1346-letters bigram ciphertext is solvable, I couldn’t help creating a new challenge. This time, I took a plaintext consisting of 1,000 letters and encrypted it with a bigram substitution. On October 7th, I published the resulting cryptogram on my blog. Here it is:
A solution setting a new world record
Again, my bigram challenge was broken, which meant that a new world record was set. This time, the solution came from Belgium. Jarl Van Eycke and Louie Helm solved the challenge with sophisticated hill climbing techniques. Here’s the solution the two posted:
THE MOST FAMOUS UNSOLVED CRYPTOGRAM THAT LOOKS LIKE A
MONOALPHABETICAL SUBSTITUTION NOT TO MENTION THE MOST
FAMOUS UNSOLVED CRYPTOGRAM IN THE WORLD IS BY NO DOUBT
THE MILLER MANUSCRIPT THIS WORK IS A HANDWRITTEN COLLECTION
OF TWO HUNDRED PAGES CONTAINING MANY ILLUSTRATIONS IT IS
NAMED FOR BOOK DEALER WILFRID MILLER WHO DISCOVERED IT IN
AN ITALIAN JESUIT CONVENT IN NINETEEN HUNDRED TWELVE
TODAY THE MILLER MANUSCRIPT IS OWNED BY THE BEINECKE
LIBRARY IN CONNECTICUT THE SCRIPT OF THE MANUSCRIPT IS
BASED ON AN ALPHABET COMPRISING APPROXIMATELY TWENTY
SYMBOLS THE VELLUM THE MILLER MANUSCRIPT IS WRITTEN UPON
WAS DATED WITH A RADIO CARBON ANALYSIS TO THE EARLY
SIXTEENTH CENTURY HUNDREDS OF EXPERTS AND GENERATIONS OF
HOBBYIST RESEARCHERS HAVE EXAMINED THE MILLER MANUSCRIPT
IN GREAT DETAIL BUT ALL THE MAIN QUESTIONS ABOUT IT ARE
STILL UNANSWERED IT IS UNKNOWN WHO WROTE IT WHERE AND WITH
THE EXCEPTION OF THE RADIO CARBON DATING EXACTLY WHEN THE
PURPOSE OF THE BOOK IS ALSO UNCLEAR THE POINTS DEPICTED IN
THE MILLER MANUSCRIPT CANT BE IDENTIFIED THEY LOOK LIKE MERE
FANTASY IMAGES THE ILLUSTRATIONS IN THE BOOK CONTAIN NOTHING
THAT PROVIDE A CLEAR RELATIONSHIP TO ANY SPECIFIC PLACES OR TIME
If I’m not wrong, this solution is a hundred percent correct. Sometimes, there are bigrams in such a ciphertext that can’t be decrypted unabiguously, but this appears not to be the case here.
How Jarl and Louie solved the challenge
The hill climber that solved the bigram 1000 challenge was written by Jarl Van Eycke. It is a light modification to AZdecrypt’s substitution solver (see below). The hill climbing algorithm Jarl used includes simulated annealing and is very optimized for deciphering bigram substitutions.
Interestingly, the fitness function Jarl used is not based on tetragrams or hexagrams but on octagrams (eight-letter blocks). The octagram statistics required were provided by Louie Helm. Dealing with octagrams is far from trivial, as there are so many. Their total number in a 26-letters alphabet is 26^8, which is approximately 200,000,000,000.
Most naive approaches to handle octgrams require more memory than modern computers have. So, Louie developed a few interesting quirks in how octagrams are compiled, encoded, and loaded into memory.
As the number of octagrams is so high, a huge amount of text is required to compile meaningful reference statistics. Louie used about 2 Terabyte of English text from the following sources:
- 1.3 million public domain books
- 4.5 million potentially copyrighted books
- All reddit comments and submissions
- All of Wikipedia
- All Project Gutenberg books
- All of the subtitles of every movie ever released
- All lyrics of all songs ever produced
- 7 billion words from usenet posts
- 34 million sentences from online news stories
- 135 million online reviews from Amazon, TripAdvisor, and Yelp
- 4.4 million Yahoo Answers exchanges
- Louie’s own re-creation of the OpenAI GPT-2 data model
After chi^2-filtering and tallying, the octagrams were restricted to encodings where both the initial and final four-letter sub-grams were valid English language combinations (~1/4 of all possible combinations). Then they were given eight-bit capped-log-frequency scores and stored in a compressed pointer table and gziped.
The text corpora Louie used provided 2,062,507,743,806 samples of 8,178,871,377 unique octagrams. The subgram structure reduced the final file to 3,631,818,052 “valid” English octagrams, which could be comfortably loaded into 14 GB of memory (instead of 195 GB for a naive implementation).
Louie is currently developing an even better version of his octagram file – with another trillion samples from about 10 billion tweets taken from Twitter. In addition, he uses an encoding improvement to reduce memory needs by another 75 percent (from 14GB to 3.5GB).
Jarl’s program first creates two new symbols for every unique bigram based on the following coding method: JOININGTHEJOINTS = 1 2 3 4 3 4 5 6 7 8 1 2 3 4 9 10. Then it changes the key (i.e., the substitution table) one symbol at a time. Bigram homophones (i.e., several ciphertext bigrams decrypting to the same plaintext bigram) are allowed but punished.
After a few days of optimization, Jarl’s program was able to consistently solve the bigram 1346 challenge (the one already solved by Norbert Biermann) without requiring a crib in less than five minutes.
After this success, Jarl felt confident and applied his program on the bigram 1000 challenge. He started before going to work. When he came back, the software had returned a 70-80% accurate decryption in about 4 hours (using Louie’s octagram statistics). Jarl shared this result with Louie. They cleaned it up and submitted it to my blog. As mentioned above, it proved correct.
About Louie and Jarl
Jarl Van Eycke is a 38 year old male from Flanders, Belgium. Originally schooled as a graphic designer, he now works as a warehouse operator for a logistics provider (15 years there) mostly handling parts for the automotive industry. He started out with cryptography in 2014 trying to solve the unsolved Zodiac 340 cryptogram and it has been a hobby ever since. Jarl is the author of AZdecrypt, a fast and powerful cipher solver, since late 2014.
Jarl has been assisted by Louie with AZdecrypt since early 2019. Louie mainly provided sophisticated n-gram information.
Louie Helm is an entrepreneur who works on genetics and machine learning day-to-day. Developing high-performance n-gram models is a hobby he picked up a few months ago.
Once again, I’m very proud that readers of this blog have broken a really tough challenge. I congratulate Jarl and Louie on this great success! I want to thank them very much for providing me detailed information about how they broke the challenge. Keep up your great work!
Further reading: The Top 50 unsolved encrypted messages: 7. The cigaret case cryptogram