Last week, I introduced a 750-letter ciphertext created with a bigram substitution. Jarl Van Eycke and Louie Helm have now solved this challenge. As far as I know, this is the shortest bigram cipher challenge ever broken.
A bigram (also known as a digraph) is a pair of letters, such as CG, HE, JS or QW. The number of bigrams in the Latin alphabet is 26×26=676. A bigram substitution is an encryption method that replaces each bigram with another one (or with a symbol or with a number between 1 and 676) based on a fixed table.
World record challenges
The best method known for breaking a bigram substitution is hill climbing. In the literature I am aware of, it is not mentioned how long a bigram substitution ciphertext needs to be in order for a successful cryptanalysis (for instance, with hill climbing) to be possible.
Two years ago, I decided to take a first step towards answering this question. I took two messages – one with 2500 and one with 5000 letters – and encrypted them with a bigram substitution. Subsequently, I published the two ciphertexts as challenges (Bigram 5000 and Bigram 2500) on this blog. My readers Norbert Biermann and Armin Krauß solved both challenges within a few days.
Earlier this year, I published another bigram challenge ciphertext. It consisted of 1346 letters. Again, it was broken by Norbert Biermann. This 1346-letters cryptogram was the shortest of its kind ever publicly broken, so Norbert had reached a new world record.
After Norbert had proven that a 1346-letters bigram ciphertext is solvable, I created my next challenge. This time, I took a plaintext consisting of exactly 1000 letters and encrypted it with a bigram substitution. On October 7th, I published the resulting cryptogram on my blog.
Again, the challenge was broken, which meant that a new world record was set. This time, the solution came from Jarl Van Eycke and Louie Helm. They had applied sophisticated hill climbing techniques, including simulated annealing and some optimizations for deciphering bigram substitutions.
Interestingly, the fitness function Jarl and Louie used was based on octagrams (eight-letter blocks). Dealing with octagrams is far from trivial, as there are so many. Their total number in a 26-letters alphabet is 26^8, which is approximately 200,000,000,000. For this reason, a huge amount of text is required to compile meaningful reference statistics. Louie used about 2 Terabyte of English text from the following sources:
- 1.3 million public domain books
- 4.5 million potentially copyrighted books
- All reddit comments and submissions
- All of Wikipedia
- All Project Gutenberg books
- All of the subtitles of every movie ever released
- All lyrics of all songs ever produced
- 7 billion words from usenet posts
- 34 million sentences from online news stories
- 135 million online reviews from Amazon, TripAdvisor, and Yelp
- 4.4 million Yahoo Answers exchanges
The Bigram 750 challenge
The next bigram challenge I created was based on a plaintext consisting of 750 letters. Here’s the ciphertext (Bigram 750 Challenge) I published on December 12th:
YYXFTVUJKXMYWODAWFZPSAPPVDWNEXAJXFPPRXKCMFBZIXDLTC VIBSKLZOXIUKPEMUXFEMDUOGPCRRMWZSVBNMYYSHLWCIAJJWOR CFCHKYRXYYJVUPAGJHBZAJZPCJSEWZSEWZCJLFWOFHSAEMXZZU JHLNGNNMYYIXUVNMYYIXBWAOKYJRYCHUBMNOQTXAPCRRMWPPWZ AMLLPCXFEMWFITKYPGISZEKJMOMUXAEREKWGQTEOXILBUGGNTC YOYAHUUQZNKYBJADXIAFICRWCRFPPGZIEEBZHUIWKRKERRLZWF GQNAJRJQNTPYKBPEKBDLNGDYXPVAZSSKUVHUBBDLXAWFZUPNHZ CCRXGOLFZUHUGNVWDYRRSAJHTRZUXAXPKMYYYCHRXZDUQSLFDY KJIAZIDLGQNAQXBVWRSWGXPPAJMPDUPPVWAVNAORHUUWNBLNFM BSSAPPDVGCGCWFWYDYZEWOPETHDLMUZURXKJHMKJYUBVOJWYDY UGCYZPZIDLXFLWPCFSEXZRWFERWFIXDTYYWUVJPNAJZURXTFHZ OAXALZXITHDLBSKLZOXIUKJPYYSHLWCIAJXFZURLVWUNPCHUPT XZHCAJANBWLPKMHUVCWRKXKMBVCXCTHUHMNCQXVBTCNGADRHPC KWUGRRKBRQXFPGWAMUDYIXDLKJJSDUOGQTRRKBXILBUGBBIPDL XZZUWAOSDLYYZPYAZSVBKBGCJPUJXLLHDYIAKBVBZENMVCRWFA
Again, Jarl Van Eycke and Louie Helm found the solution. Last Saturday, they published it as a comment to my blog post about the challenge. Here is the plaintext they found:
IN THE FOLLOWING YEAR FOSTER ACQUIRED THE RIGHT TO USE THE
NAME MARLBOROUGH AND THE MODEL DESIGNATION NINE HUNDRED
WHICH HAD ORIGINALLY BEEN USED FOR AN OPEN OPERATING SYSTEM
PRESENTED IN NINETEEN NINETY BY NORWEGIAN SOFTWARE DESIGNER
PETER IDE THE MARLBOROUGH NINE HUNDRED WHICH WAS PRODUCED IN
A GERMAN FACTORY IN PROBABLY OVER FOUR THOUSAND STYLES IS
FAR LESS WELL KNOWN THAN THE LION GAMMA THREE AND THERE ARE SOME
AMBIGUITIES AND INCONSISTENCIES REGARDING ITS PRODUCTION
NEVERTHELESS IT IS CONSIDERED A MODERN CLASSIC AND STATE OF
MASTERPIECE CARS CONNINGHAM THEN DESIGNED A NEW DISCUS
CONCEPT FOR THE THUNDERBIRD HARDWARE TRYING TO SOME DESIGN
FEATURES FROM THE MARLBOROUGH NINE HUNDRED THESE INCLUDE AN
IMPROVED KEYBOARD A NEW COLOR DISPLAY AND ADDITIONAL INPUT
DEVICES IN TWO THOUSAND ONE THE NEW MODEL WAS INTRODUCED TO THE
PRESS AT THE INFORMATION TECHNOLOGY CONVENTION IN NEW YORK
Congratulations to Jarl and Louie!
In order to create the plaintext, I took a German Wikipedia article, translated it to English using Deepl and changed all names, places, technical expressions, and numbers. To my regret, one word got lost during this procedure (“TRYING TO [INCLUDE] SOME DESIGN FEATURES”). I hope this mistake didn’t cause too much confusion.
As far as I see, there is only one difference between the original plaintext and Jarl and Louie’s version: PETER IDE instead of PER HILDE.
How Jarl and Louie solved the challenge
Louie Helm provided me the following explanation of how he and Jarl found the solution:
Jarl pointed out the new challenge [the Bigram 750 Challenge] to me. I put a new single-source 8-gram file to work in AZdecrypt. It produced some pretty convincing “phantom solves”, including one that began: “AT THE FESTIVAL OF THE BULLS…” which scored nearly as well as the eventual solution! I spent a day fruitlessly trying to solve the 750 using various strategies along with this false crib.
After a couple days of no progress, I switched to a development version of my 8-gram file that now includes all the twitter data and also some other new extremely large data sources. This model is generally a few percent better in solve accuracy given enough computing time.
After six hours, it found a solve that was 55.93% letter accurate (but I only noticed it after it had been running for ~14 hours total). It was pretty obvious that the underlying message AZdecrypt found was fairly single-topic, so I emailed it to Jarl, believing it was possible that at least some portion of it was correct. Then two hours later, AZDecrypt refined the solve to (what I would later be able to calculate was) a 69.49% letter-accurate solution.
Jarl helped me crib several portions that were completely unconstrained (with no repeated symbols anywhere in the ciphertext). After a few messages back and forth, we were able to clean up the solution to a satisfactory solve that we felt confident in. There are still a few portions we can not be completely sure of. The “Peter Ide” portion could easily be “Peak Code” or something similar. There’s not enough symbol reuse to know for sure.
The main differences between our approach to the 750 versus the 1000 bigram challenge was a slightly more powerful 8-gram model, using more aggressive annealing parameters, some generalized convergence improvements Jarl added to AZdecrypt, and being assisted by a new version of Peter Norvig’s word-gram model for automated word division that I recently re-wrote.
I’m planning to do some more testing to see how much shorter of a message like this we could have possibly solved. We’ve got to be very close to that limit. I mentioned to Jarl before we solved it that I was somewhat pessimistic about our chances. This was almost entirely due to the 750’s exceedingly high multiplicity.
Bigram challenge 1346 = 0.32392
Bigram challenge 1000 = 0.39400
Bigram challenge 750 = 0.45333 [!]
I’m sure there’s a few examples of substitution ciphers with high multiplicity that AZdecrypt can solve (or that have been solved in general). But I’m not aware of any similar ciphers off the top of my head with multiplicity >0.45 that have been solved. Do you know of any?
For reference, here’s the multiplicity of some solved ciphers:
z408 = Multiplicity: 0.13235
Beale 2 = Multiplicity: 0.23884
Jarlve Sagan = Multiplicity: 0.29411
169 = Multiplicity: 0.31952
four-line-king-bahler = Multiplicity: 0.3875
forty feet = Multiplicity: 0.43589
The last two are basically toy ciphers designed to be solved by hand. Perhaps you know more about this? Are there any non-trivial substitution ciphers with multiplicity >0.45333 that have been solved?
And for reference, when I first saw your 750 bigram challenge, I was immediately discouraged because it most reminded me of the unsolved alanbenji cipher (multiplicity: 0.4375).
Can a reader answer Louie’s question about multiplicity? The multiplicity of a ciphertext is defined as the number of letters in the alphabet used (for instance 26) devided by the number of letters in the text.
About Louie and Jarl
Jarl Van Eycke is a 38 year old cipher expert from Flanders, Belgium. Originally schooled as a graphic designer, he now works as a warehouse operator for a logistics provider, mostly handling parts for the automotive industry. He started out with cryptography in 2014 trying to solve the unsolved Zodiac 340 cryptogram and it has been a hobby ever since. Jarl is the author of AZdecrypt, a fast and powerful cipher solver, since late 2014.
Jarl has been assisted by Louie with AZdecrypt since early 2019. Louie mainly provided sophisticated n-gram information.
Louie Helm is an entrepreneur who works on genetics and machine learning day-to-day. Developing high-performance n-gram models is a hobby he picked up a few months ago.
Again, I’m very proud that some of my readers have broken this really tough challenge. I congratulate Jarl and Louie on this success! Keep up your great work!
Further reading: The Top 50 unsolved encrypted messages: 7. The cigaret case cryptogram